Nghiên cứu dữ liệu

Trong thực tế, Walmart đã chạy các chương trình khuyến mãi trong các ngày lễ lớn trong năm. Có 4 ngày lễ lớn đó là Siêu cúp bóng bầu dục Mỹ (Super Bowl - tổ chức vào chủ nhật đầu tiên của tháng Hai. Đây là một sự kiện thể thao lớn và ngày tổ chức Super Bowl được người Mỹ coi là ngày lễ quốc gia của Hoa Kỳ (theo wiki https://vi.wikipedia.org/wiki/Super_Bowl)), ngày lễ lao động (Labor Day - ngày một tháng 5), lễ tạ ơn (Thanksgiving, ngày lễ tạ ơn ở Mỹ được tổ chức vào ngày thứ Năm lần thứ tư của tháng 11, còn ở Canada ngày lễ tạ ơn được tổ chức vào ngày thứ hai lần thứ hai của tháng 10, theo wiki https://en.wikipedia.org/wiki/Thanksgiving), lễ giáng sinh (Christmas ngày 24 và 25 tháng 12 theo wiki https://en.wikipedia.org/wiki/Christmas ). Những tuần có chứa những ngày lễ lớn này được đánh trọng số gấp 5 lần những tuần khác. Chúng ta phải xây dựng mô hình để mô hình hoá các tác động của việc giảm giá trong các tuần lễ này khi không có dữ liệu lịch sử đầy đủ.

Tập dữ liệu được cung cấp bao gồm:

Tập train: chứa dữ liệu số bán từ 05-02-2010 đến 01-11-2012. Các trường dữ liệu là: store number - mã cửa hàng, Dept number - mã sản phẩm, Date - Tuần, Weekly_Sales - số bán, IsHoliday - Nếu tuần đó có chứa các holidate thì đánh 1 ngược lại đánh 0.

Tập test: Chứa dữ liệu test, có các cột thuộc tính như tập train

Tập features: Chứa thông tin thêm về của hàng, bao gồm store - mã cửa hàng, Date - ngày, Temperature - Nhiệt độ, Fuel_Price - giá dầu (ở mỹ, mỗi khu vực khác nhau sẽ có giá nhiên liệu khác nhau), MarkDown1, MarkDown2,… , MarkDown5 - một chỉ số gì đó mà tác giả không cung cấp định nghĩa cho chúng ta, CPI - chỉ số giá tiêu dùng, Unemployment - tình trạng thất nghiệp, IsHoliday - Tuần có chứa ngày nghỉ.

Phân tích dữ liệu

Mình sẽ import một số thư viện cần thiết

 1import pandas as pd
 2import numpy as np
 3
 4#Do some statistics
 5from scipy.misc import imread
 6from scipy import sparse
 7import scipy.stats as ss
 8import math
 9
10#Nice graphing tools
11import matplotlib
12import matplotlib.pyplot as plt
13import seaborn as sns

Đọc các file data lên, merge các file lại với nhau

 1
 2
 3train = pd.read_csv('data/train.csv')
 4test = pd.read_csv('data/test.csv')
 5feature = pd.read_csv('data/features.csv')
 6
 7train = train.merge(feature, how='left', on=['Store','Date'])
 8test = test.merge(feature, how='left', on=['Store','Date'])
 9
10
11# Merge in store info
12stores = pd.read_csv("data/stores.csv")
13train = train.merge(stores, how='left', on='Store')
14test = test.merge(stores, how='left', on='Store')
15print(train.head())

Kết quả

1   Store  Dept        Date  Weekly_Sales  IsHoliday_x  Temperature  Fuel_Price  MarkDown1  MarkDown2  MarkDown3  MarkDown4  MarkDown5         CPI  Unemployment  IsHoliday_y Type    Size  Split
20      1     1  2010-02-05      24924.50        False        42.31       2.572        NaN        NaN        NaN        NaN        NaN  211.096358         8.106        False    A  151315  Train
31      1     1  2010-02-12      46039.49         True        38.51       2.548        NaN        NaN        NaN        NaN        NaN  211.242170         8.106         True    A  151315  Train
42      1     1  2010-02-19      41595.55        False        39.93       2.514        NaN        NaN        NaN        NaN        NaN  211.289143         8.106        False    A  151315  Train
53      1     1  2010-02-26      19403.54        False        46.63       2.561        NaN        NaN        NaN        NaN        NaN  211.319643         8.106        False    A  151315  Train
64      1     1  2010-03-05      21827.90        False        46.50       2.625        NaN        NaN        NaN        NaN        NaN  211.350143         8.106        False    A  151315  Train

Mới có 5 dòng đầu tiên mà thấy các chỉ số markdown Nan rồi.

Chúng ta tiến hành một số phân tích dữ liệu. À, Mình sẽ merge dữ liệu train và test lại rồi phân tích thống kê

1df = pd.concat([train,test],axis=0) # Join train and test
2
3print(df.describe())

Kết quả

1                 CPI           Dept     Fuel_Price      MarkDown1      MarkDown2      MarkDown3      MarkDown4      MarkDown5          Size          Store    Temperature   Unemployment   Weekly_Sales
2count  498472.000000  536634.000000  536634.000000  265596.000000  197685.000000  242326.000000  237143.000000  266496.000000  536634.00000  536634.000000  536634.000000  498472.000000  421570.000000
3mean      172.090481      44.277301       3.408310    7438.004144    3509.274827    1857.913525    3371.556866    4324.021158  136678.55096      22.208621      58.771762       7.791888   15981.258123
4std        39.542149      30.527358       0.430861    9411.341379    8992.047197   11616.143274    6872.281734   13549.262124   61007.71180      12.790580      18.678716       1.865076   22711.183519
5min       126.064000       1.000000       2.472000   -2781.450000    -265.760000    -179.260000       0.220000    -185.170000   34875.00000       1.000000      -7.290000       3.684000   -4988.940000
625%       132.521867      18.000000       3.041000    2114.640000      72.500000       7.220000     336.240000    1570.112500   93638.00000      11.000000      45.250000       6.623000    2079.650000
750%       182.442420      37.000000       3.523000    5126.540000     385.310000      40.760000    1239.040000    2870.910000  140167.00000      22.000000      60.060000       7.795000    7612.030000
875%       213.748126      74.000000       3.744000    9303.850000    2392.390000     174.260000    3397.080000    5012.220000  202505.00000      33.000000      73.230000       8.549000   20205.852500
9max       228.976456      99.000000       4.468000  103184.980000  104519.540000  149483.310000   67474.850000  771448.100000  219622.00000      45.000000     101.950000      14.313000  693099.360000

Phân tích một chút:

Bỏ qua cột Dept và Store vì nó là mã sản phẩm và mã của hàng, người ta thích đặt số bao nhiêu thì đặt.

Các chỉ số MarkDown có độ lệch chuẩn khá cao.

Nhiệt độ min là -7.29, max là 101.95, trung bình là 58, nên không thể là độ C được, có thể là độ F

Xem thử hệ số tương quan giữa các column như thế nào

 1sns.set(style="white")
 2
 3# Compute the correlation matrix
 4corr = df.corr()
 5
 6# Generate a mask for the upper triangle
 7mask = np.zeros_like(corr, dtype=np.bool)
 8mask[np.triu_indices_from(mask)] = True
 9
10# Set up the matplotlib figure
11f, ax = plt.subplots(figsize=(11, 9))
12
13# Generate a custom diverging colormap
14cmap = sns.diverging_palette(220, 10, as_cmap=True)
15
16# Draw the heatmap with the mask and correct aspect ratio
17sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
18            square=True, linewidths=.5, cbar_kws={"shrink": .5})
19
20plt.show()

Hình ảnh Hệ số tương quan giữa các cột trong dữ liệu

Phân tích một chút, chúng ta thấy rằng MarkDown5 hầu như không có liên quan gì đến các column còn lại. Hệ số trải từ -0.3 đến 0.3 chứng tỏ mổi quan hệ giữa các cột là khá lỏng lẻo. Chỉ số giá tiêu dùng tương quan tỷ lệ nghịch với tình trạng thất nghiệp (hợp lý không nhỉ). Kích thước cửa hàng càng bự thì bán càng nhiều (ok hiển nhiên), sản phẩm có mã càng lớn thì bán càng nhiều (? có lẽ là sản phẩm mới, người mỹ thích mua sản phẩm mới chăng). Và một vấn đề quan trọng là giá nhiên liệu, isHoliday, nhiệt độ không có mối tương quan với weekly sales. Chỉ số CPI và tình trạng thất nghiệp cũng ảnh hưởng không lớn với weekly sales.

Thử plot lên hình ảnh về số lượng bán và kích thước cửa hàng xem sao

1plt.scatter( df['Size'],df['Weekly_Sales'])
2plt.show()

Hình ảnh Tương quan giữa số bán và kích thước cửa hàng

Nhìn vào hình trên, chúng ta thấy rằng cửa hàng có kích thước nhỏ số bán cũng không tăng đột biến khi gặp ngày lễ, cửa hàng kích thước siêu bự có tỷ lệ đột biến thấp, cửa hàng trung trung có đột biến, ở khúc size 125000 và số bán là 700000. Chúng ta hãy xem những ngày có số bán lớn rơi vào ngày nào. Dựa vào bảng desription ở phía trên đã phân tích, trung bình của số bán là 15981 và lệch chuẩn là 22711, cộng lại là 15981 + 22711 = 38692, nhìn trên đô thị thì phần đột biến khá lớn. Max là 700000, min là 0 (cái này nhìn hình, không phải số thực tế ở bảng mô tả), mình sẽ lấy ra những ngày có số bán lớn hơn 350000 (vượt qua ngưỡng trung bình + độ lệch chuẩn rất nhều -> ngoại lệ là đây) xem những ngày đó là ngày gì

1
2print(df.loc[df['Weekly_Sales'] >350000].head(10))

In ra top 10 thằng đầu tiên

 1
 2               CPI        Date  Dept  Fuel_Price  IsHoliday_x  IsHoliday_y  MarkDown1  MarkDown2  MarkDown3  MarkDown4  MarkDown5    Size  Split  Store  Temperature Type  Unemployment  Weekly_Sales
 337201   126.669267  2010-11-26    72       2.752         True         True        NaN        NaN        NaN        NaN        NaN  205863  Train      4        48.08    A         7.127     381072.11
 437253   129.836400  2011-11-25    72       3.225         True         True     561.45     137.88   83340.33      44.04    9239.23  205863  Train      4        47.96    A         5.143     385051.04
 588428   126.983581  2010-12-24     7       3.236        False        False        NaN        NaN        NaN        NaN        NaN  126512  Train     10        57.06    B         9.003     406988.63
 695373   126.669267  2010-11-26    72       3.162         True         True        NaN        NaN        NaN        NaN        NaN  126512  Train     10        55.33    B         9.003     693099.36
 795377   126.983581  2010-12-24    72       3.236        False        False        NaN        NaN        NaN        NaN        NaN  126512  Train     10        57.06    B         9.003     404245.03
 895425   129.836400  2011-11-25    72       3.760         True         True     174.72     329.00  141630.61      79.00    1009.98  126512  Train     10        60.68    B         7.874     630999.19
 9115222  126.669267  2010-11-26    72       3.162         True         True        NaN        NaN        NaN        NaN        NaN  112238  Train     12        47.66    B        14.313     359995.60
10115274  129.836400  2011-11-25    72       3.622         True         True    5391.83       8.00   63143.29      49.27    2115.67  112238  Train     12        53.25    B        12.890     360140.66
11128984  182.544590  2010-12-24     7       3.141        False        False        NaN        NaN        NaN        NaN        NaN  200898  Train     14        30.59    A         8.724     356867.25
12135665  182.783277  2010-11-26    72       3.039         True         True        NaN        NaN        NaN        NaN        NaN  200898  Train     14        46.15    A         8.724     474330.10

Nhìn vào bảng trên, chúng ta thấy rằng 10 ngày đầu tiên tập trung chủ yếu ở tháng 11 và tháng 12, tháng 12 là 24-25 tháng 12 -> ngày noel, còn tháng 11 là 25-26 tháng 11 (ngày gì vậy ta, trong mô tả không thấy) Tra lịch thì ngày 25 tháng 11 năm 2011 trúng thứ sáu, tra trên mạng một thông tin khá quan trong là “Black Friday sẽ rơi vào khoảng ngày 23-29 tháng 11” -> không nghi ngờ gì nữa có thể là ngày này đây. Thử tra tiếp ngày 26 tháng 11 năm 2010, cũng là thứ sáu luôn -> ngày black friday và ngày noel có sức mua điên cuồng quá.

Mình dùng một kỹ thuật nhỏ là giảm dần số bán, để ra số bán tối thiểu mà ngày black friday và ngày nodel vẫn còn giữ vị trí thống trị. Kỹ thuật khá đơn giản thôi, từ giá trị 350000, mỗi lần mình sẽ giảm đi 10000, và đếm số lần xuất hiện của các ngày, nếu có ngày nào đó nằm ngoài tuần chứa black friday và nodel thì mình dừng. Sau một hồi tìm kiếm và số bán đã xuất hiện, đó là 290000

1print(df.loc[df['Weekly_Sales'] >290000,"Date"].value_counts())

12010-11-26    16
22011-11-25    14
32010-12-24     8
42011-12-23     3
52010-02-05     1

Làm sạch dữ liệu

Xử lý missing values

Một vấn đề khá quan trọng là trong tập dữ liệu này missing value khá nhiều, thử đếm số lượng null trong data cho ta biết được rằng

 1CPI              38162
 2Date                 0
 3Dept                 0
 4Fuel_Price           0
 5IsHoliday_x          0
 6IsHoliday_y          0
 7MarkDown1       271038
 8MarkDown2       338949
 9MarkDown3       294308
10MarkDown4       299491
11MarkDown5       270138
12Size                 0
13Split                0
14Store                0
15Temperature          0
16Type                 0
17Unemployment     38162
18Weekly_Sales    115064

Các giá trị MarkDown bị null khá nhiều, cách đơn giản nhất là set 0 cho tất cả các giá trị null ( Mình lưu log lại những index null của các markdown).

1df = df.assign(md1_present = df['MarkDown1']notnull())
2df = df.assign(md2_present = df['MarkDown2']notnull())
3df = df.assign(md3_present = df['MarkDown3']notnull())
4df = df.assign(md4_present = df['MarkDown4']notnull())
5df = df.assign(md5_present = df['MarkDown5'].notnull())
6
7df.fillna(0, inplace=True)

Tạo đặc trưng

Đặc trưng holiday

1df['IsHoliday'] = 'IsHoliday_' + df['IsHoliday_x'].map(str)
2holiday_dummies = pd.get_dummies(df['IsHoliday'])

Đặc trưng ngày tháng

Rút trích tháng

1df['DateType'] = [datetime.strptime(date, '%Y-%m-%d').date() for date in df['Date'].astype(str).values.tolist()]
2df['Month'] = [date.month for date in df['DateType']]
3df['Month'] = 'Month_' + df['Month'].map(str)
4Month_dummies = pd.get_dummies(df['Month'] )

Rút trích ngày trước giáng sinh và black friday

1df['Black_Friday'] = np.where((df['DateType']==datetime(2010, 11, 26).date()) | (df['DateType']==datetime(2011, 11, 25).date()), 'yes', 'no')
2df['Pre_christmas'] = np.where((df['DateType']==datetime(2010, 12, 23).date()) | (df['DateType']==datetime(2010, 12, 24).date()) | (df['DateType']==datetime(2011, 12, 23).date()) | (df['DateType']==datetime(2011, 12, 24).date()), 'yes', 'no')
3df['Black_Friday'] = 'Black_Friday_' + df['Black_Friday'].map(str)
4df['Pre_christmas'] = 'Pre_christmas_' + df['Pre_christmas'].map(str)
5Black_Friday_dummies = pd.get_dummies(df['Black_Friday'] )
6Pre_christmas_dummies = pd.get_dummies(df['Pre_christmas'] )

Thêm các đặc trưng vào trong dữ liệu

1
2df = pd.concat([df,holiday_dummies,Pre_christmas_dummies,Black_Friday_dummies],axis=1)

Thêm đặc trưng trung vị của từng loại cửa hàng vào từng tháng, do một số của hàng sẽ bị NA ở cột số bán ở một thời điểm nào đó, nên chúng ta replace số bán là 0 có vẻ không hợp lý lắm. Mình chọn cách là thay thế bằng trung bình của số bán trong tháng của cửa hàng cùng loại. Nhưng trước tiên thì tính trung bình số bán của từng loại cửa hàng cái đã.

1
2medians = pd.DataFrame({'Median Sales' :df.loc[df['Split']=='Train'].groupby(by=['Type','Dept','Store','Month','IsHoliday'])['Weekly_Sales'].median()}).reset_index()
3print(medians.head())

Kết quả

1     Type    Dept    Store     Month        IsHoliday  Median Sales
20  Type_A  Dept_1  Store_1   Month_1  IsHoliday_False     17350.585
31  Type_A  Dept_1  Store_1  Month_10  IsHoliday_False     23388.030
42  Type_A  Dept_1  Store_1  Month_11  IsHoliday_False     19551.115
53  Type_A  Dept_1  Store_1  Month_11   IsHoliday_True     19865.770
64  Type_A  Dept_1  Store_1  Month_12  IsHoliday_False     39109.390

thêm dữ liệu vào trong data chính, loại bỏ NA và tạo key cho mỗi dòng để dễ dàng truy xuất

1df = df.merge(medians, how = 'outer', on = ['Type','Dept','Store','Month','IsHoliday'])
2
3# Fill NA
4df['Median Sales'].fillna(df['Median Sales'].loc[df['Split']=='Train'].median(), inplace=True)
5
6# Create a key for easy access
7
8df['Key'] = df['Type'].map(str)+df['Dept'].map(str)+df['Store'].map(str)+df['Date'].map(str)+df['IsHoliday'].map(str)

Chúng ta sẽ dự đoán số bán của tuần kế tiếp dựa vào kết quả số bán của tuần hiện tại, nên trong dữ liệu sẽ lưu trên ngày của tuần trước đó để dễ truy xuất. Vì 1 tuần có 7 ngày, chúng ta sẽ lưu giá trị là ngày ở cột hiện tại - 7

1df['DateLagged'] = df['DateType']- timedelta(days=7)

Và giờ đây, chúng ta sẽ lặp qua toàn bộ các dòng trên tập dữ liệu, kiểm tra xem có dòng nào số bán nan hông, nếu có thì sẽ thay bằng trung bình đã tính ở trên. Ở đây mình tạo một sorted dataset để truy xuất cho nhanh

 1
 2#Make a sorted dataframe. This will allow us to find lagged variables much faster!
 3sorted_df = df.sort_values(['Store', 'Dept','DateType'], ascending=[1, 1,1])
 4sorted_df = sorted_df.reset_index(drop=True) # Reinitialize the row indices for the loop to work
 5
 6sorted_df['LaggedSales'] = np.nan # Initialize column
 7sorted_df['LaggedAvailable'] = np.nan # Initialize column
 8last=df.loc[0] # intialize last row for first iteration. Doesn't really matter what it is
 9row_len = sorted_df.shape[0]
10for index, row in sorted_df.iterrows():
11    lag_date = row["DateLagged"]
12    # Check if it matches by comparing last weeks value to the compared date
13    # And if weekly sales aren't 0
14    if((last['DateType']== lag_date) & (last['Weekly_Sales']>0)):
15        sorted_df.set_value(index, 'LaggedSales',last['Weekly_Sales'])
16        sorted_df.set_value(index, 'LaggedAvailable',1)
17    else:
18        sorted_df.set_value(index, 'LaggedSales',row['Median Sales']) # Fill with median
19        sorted_df.set_value(index, 'LaggedAvailable',0)
20
21    last = row #Remember last row for speed
22    if(index%int(row_len/10)==0): #See progress by printing every 10% interval
23        print(str(int(index*100/row_len))+'% loaded')
24
25print(sorted_df[['Dept', 'Store','DateType','LaggedSales','Weekly_Sales','Median Sales']].head())

 19% loaded
 219% loaded
 329% loaded
 439% loaded
 549% loaded
 659% loaded
 769% loaded
 879% loaded
 989% loaded
1099% loaded
11     Dept    Store    DateType  LaggedSales  Weekly_Sales  Median Sales
120  Dept_1  Store_1  2010-02-05     23510.49      24924.50      23510.49
131  Dept_1  Store_1  2010-02-12     24924.50      46039.49      37887.17
142  Dept_1  Store_1  2010-02-19     46039.49      41595.55      23510.49
153  Dept_1  Store_1  2010-02-26     41595.55      19403.54      23510.49
164  Dept_1  Store_1  2010-03-05     19403.54      21827.90      21280.40

Công việc đơn giản tiếp theo là merge dữ liệu vào data chính và tính độ lệch giữa 2 tuần bán

1# Merge by store and department
2df = df.merge(sorted_df[['Dept', 'Store','DateType','LaggedSales','LaggedAvailable']], how = 'inner', on = ['Dept', 'Store','DateType'])
3df['Sales_dif'] = df['Median Sales'] - df['LaggedSales']

Và bây giờ , thay vì ta ước lượng weekly sales, chúng ta sẽ ước lượng độ lệch giữa week sales và median sales (đây là một cách trong những cách để tính điểm dừng của dữ liệu time series)

1df['Difference'] = df['Median Sales'] - df['Weekly_Sales']

Huấn luyện mô hình

Lựa chọn các đặc trưng huấn luyện

 1selector = [
 2    #'Month',
 3    'CPI',
 4    'Fuel_Price',
 5    'MarkDown1',
 6    'MarkDown2',
 7    'MarkDown3',
 8    'MarkDown4',
 9    'MarkDown5',
10    'Size',
11    'Temperature',
12    'Unemployment',
13
14
15
16    'md1_present',
17    'md2_present',
18    'md3_present',
19    'md4_present',
20    'md5_present',
21
22    'IsHoliday_False',
23    'IsHoliday_True',
24    'Pre_christmas_no',
25    'Pre_christmas_yes',
26    'Black_Friday_no',
27    'Black_Friday_yes',
28    'LaggedSales',
29    'Sales_dif',
30    'LaggedAvailable'
31    ]

Tách dữ liệu train và test riêng ra

1
2train = df.loc[df['Split']=='Train']
3test = df.loc[df['Split']=='Test']

Lấy ngẫu nhiên 20% dữ liệu ở tập train để validation

1# Set seed for reproducability
2np.random.seed(42)
3X_train, X_val, y_train, y_val = train_test_split(train[selector], train['Difference'], test_size=0.2, random_state=42)

Huấn luyện bằng neural network sử dụng lstm

 1
 2adam_regularized = Sequential()
 3
 4    # First hidden layer now regularized
 5    model.add(Dense(32,activation='relu',
 6                    input_dim=X_train.shape[1],
 7                    kernel_regularizer = regularizers.l2(0.01)))
 8
 9    # Second hidden layer now regularized
10    adam_regularized.add(Dense(16,activation='relu',
11                       kernel_regularizer = regularizers.l2(0.01)))
12
13    # Output layer stayed sigmoid
14    adam_regularized.add(Dense(1,activation='linear'))
15
16    # Setup adam optimizer
17    adam_optimizer=keras.optimizers.Adam(lr=0.01,
18                    beta_1=0.9,
19                    beta_2=0.999,
20                    epsilon=1e-08)
21
22    # Compile the model
23    adam_regularized.compile(optimizer=adam_optimizer,
24                  loss='mean_absolute_error',
25                  metrics=['acc'])
26
27    # Train
28    history=adam_regularized.fit(X_train, y_train, # Train on training set
29                                 epochs=10, # We will train over 1,000 epochs
30                                 batch_size=2048, # Batch size
31                                 verbose=0) # Suppress Keras output
32    print('eval',model.evaluate(x=X_val,y=y_val))
33
34    # Plot network
35    plt.plot(history.history['loss'], label='Adam Regularized')
36    plt.xlabel('Epochs')
37    plt.ylabel('loss')
38    plt.legend()
39    plt.show()

1eval:  [1457.0501796214685, 0.002312783168124545]

Hình ảnh Độ lỗi trên tập train

Độ lỗi trên tập train giảm xuống đến gần 1450 thì đừng hẳn, không thể giảm được nữa

Giá trị độ lệch trên tập evaluation là 1457.0501796214685

Thử huấn luyện bằng random forest

 1regr = RandomForestRegressor(n_estimators=20, criterion='mse', max_depth=None,
 2                        min_samples_split=2, min_samples_leaf=1,
 3                        min_weight_fraction_leaf=0.0, max_features='auto',
 4                        max_leaf_nodes=None, min_impurity_decrease=0.0,
 5                        min_impurity_split=None, bootstrap=True,
 6                        oob_score=False, n_jobs=1, random_state=None,
 7                        verbose=2, warm_start=False)
 8
 9    #Train on data
10    regr.fit(X_train, y_train.ravel())
11    y_pred_random = regr.predict(X_val)
12
13    y_val = y_val.to_frame()
14
15    # Transform forest predictions to observe direction of change
16    direction_true1= y_val.values
17    direction_predict = y_pred_random
18
19    y_val['Predicted'] = y_pred_random
20    df_out = pd.merge(train,y_val[['Predicted']],how = 'left',left_index = True, right_index = True,suffixes=['_True','_Pred'])
21    df_out = df_out[~pd.isnull(df_out['Predicted'])]
22
23    df_out['prediction'] = df_out['Median Sales']-df_out['Predicted']
24
25    print("Medians: "+str(sum(abs(df_out['Difference']))/df_out.shape[0]))
26    print("Random Forest: "+str(sum(abs(df_out['Weekly_Sales']-df_out['prediction']))/df_out.shape[0]))

Kết quả

 1
 29% loaded
 319% loaded
 429% loaded
 539% loaded
 649% loaded
 759% loaded
 869% loaded
 979% loaded
1089% loaded
1199% loaded
12     Dept    Store    DateType  LaggedSales  Weekly_Sales  Median Sales
130  Dept_1  Store_1  2010-02-05     23510.49      24924.50      23510.49
141  Dept_1  Store_1  2010-02-12     24924.50      46039.49      37887.17
152  Dept_1  Store_1  2010-02-19     46039.49      41595.55      23510.49
163  Dept_1  Store_1  2010-02-26     41595.55      19403.54      23510.49
174  Dept_1  Store_1  2010-03-05     19403.54      21827.90      21280.40
18[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
19building tree 1 of 20
20[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.5s remaining:    0.0s
21building tree 2 of 20
22building tree 3 of 20
23building tree 4 of 20
24building tree 5 of 20
25building tree 6 of 20
26building tree 7 of 20
27building tree 8 of 20
28building tree 9 of 20
29building tree 10 of 20
30building tree 11 of 20
31building tree 12 of 20
32building tree 13 of 20
33building tree 14 of 20
34building tree 15 of 20
35building tree 16 of 20
36building tree 17 of 20
37building tree 18 of 20
38building tree 19 of 20
39building tree 20 of 20
40[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  2.2min finished
41[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
42[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
43[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    1.1s finished
44Medians: 1545.7406070759525
45Random Forest: 1356.4670052620745

Trung bình lệch của random forest là 1356, giá trị này nhỏ hơn so với giá trị output của lstm trả về.

Thử huấn luyện bằng XGBoost

 1
 2param_dist = { 'max_depth':5}
 3
 4    model = XGBRegressor(**param_dist)
 5
 6    #Train on data
 7    model.fit(X_train, y_train.ravel())
 8    y_pred_random = model.predict(X_val)
 9
10    y_val = y_val.to_frame()
11
12    # Transform forest predictions to observe direction of change
13    direction_true1= y_val.values
14    direction_predict = y_pred_random
15
16    y_val['Predicted'] = y_pred_random
17    df_out = pd.merge(train,y_val[['Predicted']],how = 'left',left_index = True, right_index = True,suffixes=['_True','_Pred'])
18    df_out = df_out[~pd.isnull(df_out['Predicted'])]
19
20    df_out['prediction'] = df_out['Median Sales']-df_out['Predicted']
21
22    print("Medians: "+str(sum(abs(df_out['Difference']))/df_out.shape[0]))
23    print("XGB Regressor: "+str(sum(abs(df_out['Weekly_Sales']-df_out['prediction']))/df_out.shape[0]))

Kết quả

1
2Medians: 1545.7406070759525
3XGB Regressor: 1354.1976755192593

Kết quả cũng gần như bằng Random forest :).

Giờ mình sẽ dùng random forest để tạo file submission

 1
 2
 3rf_model = RandomForestRegressor(n_estimators=80, criterion='mse', max_depth=None,
 4                      min_samples_split=2, min_samples_leaf=1,
 5                      min_weight_fraction_leaf=0.0, max_features='auto',
 6                      max_leaf_nodes=None, min_impurity_decrease=0.0,
 7                      min_impurity_split=None, bootstrap=True,
 8                      oob_score=False, n_jobs=1, random_state=None,
 9                      verbose=0, warm_start=False)
10
11#Train on data
12rf_model.fit(train[selector], train['Difference'])
13final_y_prediction = rf_model.predict(test[selector])
14
15testfile = pd.concat([test.reset_index(drop=True), pd.DataFrame(final_y_prediction)], axis=1)
16testfile['prediction'] = testfile['Median Sales']-testfile[0]
17
18submission = pd.DataFrame({'id':pd.Series([''.join(list(filter(str.isdigit, x))) for x in testfile['Store']]).map(str) + '_' +
19                           pd.Series([''.join(list(filter(str.isdigit, x))) for x in testfile['Dept']]).map(str)  + '_' +
20                           testfile['Date'].map(str),
21                          'Weekly_Sales':testfile['prediction']})
22
23submission.to_csv('submission.csv',index=False)

Sau khi submit mô hình, mình đạt được kết quả là 4455.96312 trên private board, và 4419.17292 trên publish board. Đây là một kết quả khá tệ (đứng hạng khoảng top 300). Sau khi mình nhìn lại mô hình thì phát hiện một số vấn đề.

Các đặc trưng trong file features.csv nó không có mối tương quan gì hết với số bán như phân tích ở trên -> mình mạnh dạng bỏ luôn file features.csv, không quan tâm đến nó nữa, tập trung vào file chính.

Bỏ mấy cái lag luôn, thử forecast chính vào cái số bán luôn xem sao

Với cửa hàng nào thì xây dựng mô hình cho cửa hàng và sản phẩm đó, không xây dựng một mô hình tổng quát áp dụng cho toàn cửa hàng. với những cửa hàng không có trong tập train hoặc những sản phẩm mà cửa hàng đó chưa bán trước đây (nói chung là không có trong tập train) thì mới áp dụng mô hình của toàn cửa hàng cho nó.

Kết quả là mình đạt được 2736 trên private board và 2657.40087 trên publish board (top 30), kết quả trên vẫn làm cho mình chưa hài lòng lắm.

Cảm ơn các bạn đã theo dõi. Hẹn gặp bạn ở các bài viết tiếp theo.

Dự Đoán Doanh Số Bán Của Các Cửa Hàng Walmart

Nghiên cứu dữ liệu

Phân tích dữ liệu

Làm sạch dữ liệu

Xử lý missing values

Tạo đặc trưng

Huấn luyện mô hình

Comments