Data cleaning Timestamp column
Recently I was looking at Cambridge city’s parking ticket dataset. I discovered that its Issue Time column, which records the hour and minute of ticket issuance, was fairly dirty.
Exploring the data, I discovered that there were 4 different types of Issue Time values within this dataset of 2 million rows.
String in UTC format YYYY-MM-DDTHH:MM:SS.FFF eg
1899-12-31T12:18:00.000
(Note: here the year, month and date are dummy values that are not used)String of format HH:MM PP eg
7:28 PM
String of format HH:MM eg
00:00
NaN
values, ie not a number
In order to conduct downstream data analysis, we need to clean this column. My goal was to output the hour and minute in integer format.
# Example test cases
1899-12-31T12:18:00.000 --> [12, 18]
7:28 PM --> [7, 28]
13:15 --> [13, 15]
NaN --> [None, None]
To do so, my strategy is to parse all the different types of strings into a Python datetime object, before returning that object’s hour and minute values.
I used the datetime library’s strptime function to convert strings to a datetime object. Referring to the Format Codes table, I specified the relevant string format rules eg %H:%M %p
for HH:mm AM/PM. Also, I checked if the length of the equaled timestamp_len, which allows me to identify UTC strings (eg 1899-12-31T12:18:00.000
).
timestamp_len = 23
### Version 1
def parse_time_str(val):
if (len(val)==timestamp_len):
# handles case 1
dt = datetime.strptime(val, "%Y-%m-%dT%H:%M:%S.%f")
elif "M" in val:
# handles case 2
dt = datetime.strptime(val, '%H:%M %p')
else:
# handles case 3
dt = datetime.strptime(val, '%H:%M')
return [dt.hour, dt.minute]
print(parse_time_str("1899-12-31T12:18:00.000")) # [12, 18]
print(parse_time_str("7:28 PM")) # [7, 28]
print(parse_time_str("7:28")) # [7, 28]
However, applying the function to my column gave me a TypeError: object of type ‘float’ has no len()
, which prompted me to check for null values.
Using isna() and value_counts() normalized, indeed we see that 0.05% of the data is null. Not a huge issue.
I updated the function to handle null values. Voila.
The apply function does take a while to run..are there speedier ways to do this? Do let me know below!