This guided project was interesting to say the least (read: difficult).
It’s the first project which I have completed where I can see how the code can be further optimised. I’ll set a mini-project for myself to come back to this one for a re-write.
For starters, I think I’ll create a few user defined functions that I know will come in handy for re-use on future projects (i.e. file import and key/value swap and reverse sort)
Hacker News Project
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as “posts”) receive votes and comments, similar to reddit.
Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
We’ll compare the two types of posts on Hacker News to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?
Ask HN posts are submitted by users to ask the Hacker News community a specific question.
Show HN posts are submitted by users to show the Hacker News community a project, product, or anything that may be interesting.
Import the Hacker News file so we can begin
In [26]:
from csv import reader opened_file = open('hacker_news.csv') read_file = reader(opened_file) hn = list(read_file) hn[:5]
Out[26]:
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
In [27]:
headers = hn[:1] # Extract header and assign to headers hn = hn[1:] # Removes the first row from the header table print(headers) # Checking the headers were extracted properly hn[:5] # Displays the first 5 rows of 'hn' to ensure the header row was removed properly
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
Out[27]:
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
In [28]:
# Create 3 empty lists ask_posts = [] show_posts = [] other_posts = [] # Loop through each rown in hn for row in hn: title = row[1] if title.lower().startswith('ask hn'): ask_posts.append(row) elif title.lower().startswith('show hn'): show_posts.append(row) else: other_posts.append(row) # check the length of the appended lists print(len(ask_posts)) print(len(show_posts)) print(len(other_posts)) # Print the first 5 rows of the "Ask Posts" list. ask_posts[:5]
1744 1162 17194
Out[28]:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
In [29]:
total_ask_comments = 0 for row in ask_posts: num_comments = int(row[4]) total_ask_comments += num_comments avg_ask_comments = total_ask_comments / len(ask_posts) print("The average number of comments on ask posts is:", avg_ask_comments)
The average number of comments on ask posts is: 14.038417431192661
In [30]:
total_show_comments = 0 for row in show_posts: num_comments = int(row[4]) total_show_comments += num_comments avg_show_comments = total_show_comments / len(show_posts) print("The average number of comments on show posts is:", avg_show_comments)
The average number of comments on show posts is: 10.31669535283993
On average, ask posts receive more comments than show posts with 14.03 comments per post.
Now let’s take a look at the number of Ask Posts and Comments by Hour Created
In [19]:
import datetime as dt result_list = [] # This will be an empty list of lists that we'll append to for row in ask_posts: result_list.append([row[6], int(row[4])]) counts_by_hour = {} comments_by_hour = {} date_format = "%m/%d/%Y %H:%M" for row in result_list: date = row[0] comment = row[1] time = dt.datetime.strptime(date, date_format).strftime("%H") # Extracts the hour from the date if time in counts_by_hour: counts_by_hour[time] += 1 comments_by_hour[time] += comment else: counts_by_hour[time] = 1 comments_by_hour[time] = comment
In [8]:
comments_by_hour # This dictionary displays hour:number of comments
Out[8]:
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
The below calculation will return the average number of comments per hour
In [9]:
avg_by_hour = [] for avg_comm in comments_by_hour: avg_by_hour.append([avg_comm, comments_by_hour[avg_comm] / counts_by_hour[avg_comm]]) avg_by_hour
Out[9]:
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
For readability, let’s swap the key and value. This will allow for easy sorting high to low.
In [10]:
swap_avg_by_hour = [] for row in avg_by_hour: swap_avg_by_hour.append([row[1], row[0]]) swap_avg_by_hour
Out[10]:
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
In [11]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
In [12]:
sorted_swap
Out[12]:
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
In [25]:
print("Top 5 Hours for Ask Posts Comments:") for avg, hr in sorted_swap[:5]: print("\n{}: {:.2f} average comments per post".format( dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))
Top 5 Hours for Ask Posts Comments: 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
In Conclusion:
The best types of posts to receive comments are Ask Posts and the best time to post is 3 PM.
The timing may be due to people winding down from work and/or taking their last break of the day before going home allowing them to take a quick browse of any new posts where they can potentially respond.
The average amount of comments an Ask Post receives at 3 PM is 38.59 comments while the second best time to post is 2 AM with an average of 23.81 comments.