Easy Data Collection for Continuous Machine Learning Development with Listly | by Cornellius Yudha Wijaya | Jul 2022

Simple no-coding data recovery tools for your job

Photo by Hal Gatewood on Unsplash

Data is at the heart of any machine learning project. This is why a lot of efforts have been made to collect data that can be used in the machine learning project. Without data, our project could not start.

One of the most complex parts of data collection is collecting the data. From defining the data set, where to find the data, and the data collection methodology, this aspect of data collection needs to be thought through.

However, the data collection job has become more manageable with the era of technological development. Many companies have invested their time and money to develop an insightful way of collecting data, for example, Listly.

In this article, I explain how we could continue to collect data with Listly for machine learning development purposes. Let’s go.

Listly is an easy-to-use web browser extension that you can configure to automatically collect data. The service is based on click-and-scrap, so we don’t need to know much about coding programming to use the service.

We only need the web page we want to extract data from (we can control which part of the page) and the Listly extension installed. The process is automated and we would quickly get the result in Excel form.

Let’s try using Listly for our data collection and develop a machine learning model based on the collected data. This item will follow the project outline depicted in the image below.

Author’s picture

In this article, I want to analyze the data and create a rating prediction model based on the data collected from Listly.

Install Listly

First, we need to install Listly browser extensions to start our web scrapping process. You can easily install the browser via this link, and you will find the extension on your browser. The page will be displayed like the image below when you click on the extension.

Author’s picture

Next, we need to create the account for Listly. Fortunately, Listly offers a free account that we can take advantage of to remove various data from the webpage. However, you can still try using the Business plan if you wish.

Author’s picture

With all the essential preparation ready, we could try collecting the data with Listly.

Data gathering

Data collection would depend on the data project we want to do. In this article, let’s say I want to analyze the Aliexpress e-commerce VR product data, then we can try to configure it with Listly.

Let’s see how to start our data deletion. First, as a starting point, I would need the URL of the webpage we want to remove, which is the Aliexpress link here.

GIF by author

By selecting the Listly part on the extension, we could control which part of the page data we want to collect. The Listly governs the rest to automatically download data from the webpage.

Currently we are trying to download data from a single web page using Listly. However, we know that most e-commerce websites have multiple pages. So, how to quickly download many web pages with Listly? In this case, we have to go through some steps.

First, after doing the Listly part on a single webpage, we needed to select the group button to delete data from multiple pages.

Author’s picture

On the next page, we need to provide all the URLs of the pages we want to delete separated by line.

Author’s picture

With Python, we can try looping the previous link we used to quickly get all the URL pages we want. For example, I want to remove 20 pages from Aliexpress VR product search.

with open('readme.txt', 'w') as fi:
for i in range(1,21):
fi.write(f"https://www.aliexpress.com/wholesale?trafficChannel=main&d=y&CatId=0&SearchText=VR<ype=wholesale&SortType=default&page={i}")
fi.write('n')

Often on a website with multiple pages, there is a query that specifies the page number. We take advantage of this and create a loop to acquire all the URLs we need in a notebook.

Author’s picture

Also, since Listly is based in South Korea, some websites, such as Aliexpress, change regions based on location. We can change the proxy setting to ensure you get the correct location. However, the feature is only available for the Enterprise level.

Author’s picture

After submitting the group scrap, we got the data in Excel format. Click the Excel Group button and download the data.

Author’s picture

Let’s check the Excel result before doing further analysis.

Author’s picture

Some data may be messy due to the structure of the web page, which means we need to clean up the data a bit. Let’s start exploring the data we deleted.

Data cleaning

Previously deleted data is stored in this GitHub repository for reproducibility. Let’s start by loading the data and doing some data cleaning.

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)df = pd.read_excel('aliexpress_vr_20.xlsx')

As we discovered earlier, data requires cleaning work to gain meaningful insight into the data. To help clean the data, I have already done the data understanding and only have to provide the code for various cleaning jobs.

Get product price

df['LABEL-6'] = df['LABEL-6'].apply(lambda x: x * 0.01) 
df['Price'] = df['LABEL-4'] + df['LABEL-6']

Get Product Rating

def check_rating(x):
if np.invert(pd.isna(x['LABEL-9'])):
return x['LABEL-9']
elif np.invert(pd.isna(x['LABEL-20'])):
return x['LABEL-20']
elif np.invert(pd.isna(x['LABEL-24'])):
return x['LABEL-24']
elif np.invert(pd.isna(x['LABEL-29'])):
return x['LABEL-29']
else:
return np.nan
df['Rating'] = df.apply(check_rating, axis =1)

Get the number of products sold

def check_sold(x):
if np.invert(pd.isna(x['LABEL-7'])) and 'sold' in x['LABEL-7']:
return int(x['LABEL-7'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-16'])) and 'sold' in x['LABEL-16']:
return int(x['LABEL-16'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-17'])) and 'sold' in x['LABEL-17']:
return int(x['LABEL-17'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-18'])) and 'sold' in x['LABEL-18']:
return int(x['LABEL-18'].split('sold')[0])
elif np.invert(pd.isna(x['LABEL-36'])) and 'sold' in x['LABEL-36']:
return int(x['LABEL-36'].split('sold')[0])
else:
return 0
df['Sold_number'] = df.apply(check_sold, axis =1)

Get Free Shipping Offer Data

def check_ship(x):
if np.invert(pd.isna(x['LABEL-10'])):
if 'Free Shipping' in x['LABEL-10']:
return 1
elif np.invert(pd.isna(x['LABEL-21'])):
if 'Free Shipping' in x['LABEL-21']:
return 1
elif np.invert(pd.isna(x['LABEL-25'])):
if 'Free Shipping' in x['LABEL-25']:
return 1
elif np.invert(pd.isna(x['LABEL-30'])):
if 'Free Shipping' in x['LABEL-30']:
return 1
elif np.invert(pd.isna(x['LABEL-32'])):
if 'Free Shipping' in x['LABEL-32']:
return 1
else:
return 0
df['free_shipping_offer'] = df.apply(check_ship, axis =1)
df['free_shipping_offer'] = df['free_shipping_offer'].fillna(0)
df['free_shipping_offer'] = df['free_shipping_offer'].apply(lambda x: int(x))

Get Free Returns Offer Data

def check_return(x):
if np.invert(pd.isna(x['LABEL-11'])):
if 'Free Return' in x['LABEL-11']:
return 1
elif np.invert(pd.isna(x['LABEL-26'])):
if 'Free Return' in x['LABEL-26']:
return 1
elif np.invert(pd.isna(x['LABEL-31'])):
if 'Free Return' in x['LABEL-31']:
return 1
elif np.invert(pd.isna(x['LABEL-40'])):
if 'Free Return' in x['LABEL-40']:
return 1
elif np.invert(pd.isna(x['LABEL-43'])):
if 'Free Return' in x['LABEL-43']:
return 1
elif np.invert(pd.isna(x['LABEL-44'])):
if 'Free Return' in x['LABEL-44']:
return 1
else:
return 0
df['free_return_offer'] = df.apply(check_return, axis =1)
df['free_return_offer'] = df['free_return_offer'].fillna(0)

If you want to add more features and clean up the data, you can do more, but for now we’ll stick with the current features. Let’s do one last cleanup step and prepare our data frame.

df = df.rename(columns = {'LABEL-2': 'Product_name', 'LABEL-12':'Store_name'})data = df[['Product_name', 'Store_name', 'Rating', 'Price', 'Sold_number', 'free_shipping_offer', 'free_return_offer' ]].copy()data.head()
Author’s picture

Data analysis

Let’s start to understand the data set we acquire for better understanding. Let’s look at the basic information of the data first.

data.info()
Author’s picture

As we can see, we have about 1200 rows of data, but only about half of the data contains the scoring data. This is probably because the quantity of products sold is too low and people have been too lazy to give ratings.

Let’s take a look at the basic stat.

data.describe()
Author’s picture

Let’s use visualization to better understand the data.

import seaborn as sns
sns.distplot(data['Rating'])
plt.title('Rating Distribution')
Author’s picture

I would do the same distribution code for price and sale number.

Author’s picture

Then I would try to visualize the free shipping and free return count plot.

sns.countplot(data['free_return_offer'])
plt.title('Free Return Offer Count Plot')
Author’s picture

From the statistics there are some information that we acquire:

  1. Most of the ratings given were good (above 4.5)
  2. Prices were below $25 for 75% of the data
  3. Some of the published products have no buyers (more than 25%)
  4. Sellers prefer to offer free shipping over free returns.

Build a simple regression model

After exploring the data, I want to create a simple regression model to predict how many VR products will be sold.

Of course, in building real models, we need to explore the data more and have a more rigid hypothesis. However, at the moment we are only trying to create a simple model from the data we delete with Listly.

First, let’s check the correlation between the numerical characteristics.

sns.heatmap(data.corr(), annot = True)
Author’s picture

It seems that the Return Offer variable affects the number sold the most. Now let’s split the data into training and test data. Note that I wouldn’t use the rating data as an independent variable.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(data[['Price', 'free_shipping_offer', 'free_return_offer']], data['Sold_number'], shuffle=True, train_size=0.3, random_state =42 )model = LinearRegression()
model.fit(X_train, y_train)

In a few lines, we managed to build the regression model. Let’s see how the evaluation of the model.

from sklearn.metrics import mean_squared_error, r2_score
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)
rmse = mean_squared_error(y_test, predictions, squared=False)
print('The r2 is: ', r2)
print('The rmse is: ', rmse)
Author’s picture

As can be seen, the evaluation shows that the model is not so good. We can improve it with various feature engineering and additional data cleaning, but leaving it at that, our goal of building a model based on the Listly scrap data has been achieved.

Data Collection Planner

When building a model or monitoring data, we know that over time there is always new data to collect. Fortunately, Listly offers a scheduler to delete data.

To find the scheduler, go to your Listly data table, then look for the Schedule button in your scrapping job.

Author’s picture

After tapping the calendar, you can set the time you want the page to be deleted, either daily, weekly, or monthly. You can also set the time when you want to delete the data.

GIF by author

Additional ideas to use with Listly

With the ease with which you delete data with Listly, you can do various things as data people, such as:

  1. Build a monitoring dashboard
  2. Social Media Portfolio Optimization
  3. Keyword research
  4. Search engine development

And many ideas you might think of; the limit is the sky. Since collecting data has been made easy with Listly, you should try the different things you can think of.

Comments are closed.