This guided project was interesting to say the least (read: difficult).

It’s the first project which I have completed where I can see how the code can be further optimised. I’ll set a mini-project for myself to come back to this one for a re-write.

For starters, I think I’ll create a few user defined functions that I know will come in handy for re-use on future projects (i.e. file import and key/value swap and reverse sort)

Hacker News Project

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as “posts”) receive votes and comments, similar to reddit.

Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

We’ll compare the two types of posts on Hacker News to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Ask HN posts are submitted by users to ask the Hacker News community a specific question.

Show HN posts are submitted by users to show the Hacker News community a project, product, or anything that may be interesting.

Import the Hacker News file so we can begin

In [26]:

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]

Out[26]:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [27]:

headers = hn[:1] # Extract header and assign to headers
hn = hn[1:] # Removes the first row from the header table

print(headers) # Checking the headers were extracted properly

hn[:5] # Displays the first 5 rows of 'hn' to ensure the header row was removed properly

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

Out[27]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [28]:

# Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

# Loop through each rown in hn
for row in hn:
    title = row[1]
    
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
        
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    
    else:
        other_posts.append(row)

# check the length of the appended lists
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

# Print the first 5 rows of the "Ask Posts" list.
ask_posts[:5]

1744
1162
17194

Out[28]:

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [29]:

total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)


print("The average number of comments on ask posts is:", avg_ask_comments)

The average number of comments on ask posts is: 14.038417431192661

In [30]:

total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)


print("The average number of comments on show posts is:", avg_show_comments)

The average number of comments on show posts is: 10.31669535283993

On average, ask posts receive more comments than show posts with 14.03 comments per post.

Now let’s take a look at the number of Ask Posts and Comments by Hour Created

In [19]:

import datetime as dt

result_list = [] # This will be an empty list of lists that we'll append to

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H") # Extracts the hour from the date
    if time in counts_by_hour:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment
    else:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment

In [8]:

comments_by_hour # This dictionary displays hour:number of comments

Out[8]:

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

The below calculation will return the average number of comments per hour

In [9]:

avg_by_hour = []

for avg_comm in comments_by_hour:
    avg_by_hour.append([avg_comm, comments_by_hour[avg_comm] / counts_by_hour[avg_comm]])

avg_by_hour

Out[9]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

For readability, let’s swap the key and value. This will allow for easy sorting high to low.

In [10]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

swap_avg_by_hour

Out[10]:

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [11]:

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [12]:

sorted_swap

Out[12]:

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [25]:

print("Top 5 Hours for Ask Posts Comments:")

for avg, hr in sorted_swap[:5]:
     print("\n{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg))

Top 5 Hours for Ask Posts Comments:

15:00: 38.59 average comments per post

02:00: 23.81 average comments per post

20:00: 21.52 average comments per post

16:00: 16.80 average comments per post

21:00: 16.01 average comments per post

In Conclusion:

The best types of posts to receive comments are Ask Posts and the best time to post is 3 PM.

The timing may be due to people winding down from work and/or taking their last break of the day before going home allowing them to take a quick browse of any new posts where they can potentially respond.

The average amount of comments an Ask Post receives at 3 PM is 38.59 comments while the second best time to post is 2 AM with an average of 23.81 comments.

The Open Project Lab

Guided Project #3: Hacker News