Create Your Own RSSHub Route
As mentioned earlier, we will create an RSS feed for GitHub Repo Issues as an example. We will show all four data collection methods mentioned:
# Via API
# Check the API documentation
Different sites have different APIs. You can check the API documentation for the site you want to create an RSS feed for. In this case, we will use the GitHub Issues API (opens new window).
# Create the main file
Open your code editor and create a new file. Since we are going to create an RSS feed for GitHub issues, it is suggested that you save the file as issue.js
, but you can name it whatever you like.
Here's the basic code to get you started:
# Retrieving user input
As mentioned earlier, we need to retrieve the GitHub username and repository name from user input. The repository name should default to RSSHub
if it is not provided in the request URL. Here's how you can do it:
Both of these code snippets do the same thing. The first one uses object destructuring to assign the user
and repo
variables, while the second one uses traditional assignment with a nullish coalescing operator to assign the repo
variable a default value of RSSHub
if it is not provided in the request URL.
# Getting data from the API
After we have the user input, we can use it to make a request to the API. In most cases, you will need to use got
from @/utils/got
(a customized got wrapper) to make HTTP requests. For more information, please refer to the got documentation (opens new window).
# Outputting the RSS
Once we have retrieved the data from the API, we need to process it further to generate an RSS feed that conforms to the RSS specification. Specifically, we need to extract the channel title, channel link, item title, item link, item description, and item publication date.
To do this, we can assign the relevant data to the ctx.state.data
object, and RSSHub's middleware will take care of the rest.
Here is the final code that you should have:
# Via HTML web page using got
# Creat the main file
To start, open your code editor and create a new file. Since we are going to create an RSS feed for GitHub issues, it is suggested that you save the file as issue.js
. However, you can also name it whatever you like.
Here's the basic code to get you started:
// Require necessary modules
const got = require('@/utils/got'); // a customised got
const cheerio = require('cheerio'); // an HTML parser with a jQuery-like API
const { parseDate } = require('@/utils/parse-date');
module.exports = async (ctx) => {
// Your logic here
ctx.state.data = {
// Your RSS output here
};
};
2
3
4
5
6
7
8
9
10
11
12
The parseDate
function is a utility function provided by RSSHub that we will use to parse dates later in the code.
You will add your own code to extract data from the HTML document, process it, and output it in RSS format. We will cover the details of this process in the next steps.
# Retrieving user input
As mentioned before, we want users to enter a GitHub username and a repository name, and fall back to RSSHub
if they don't enter the repository name in the request URL.
module.exports = async (ctx) => {
// Retrieve user and repository name from the URL parameters
const { user, repo = 'RSSHub' } = ctx.params;
ctx.state.data = {
// Your RSS output here
};
};
2
3
4
5
6
7
8
In this code, user
will be set to the value of user
parameter, and repo
will be set to the value of repo
parameter if it exists, and RSSHub
otherwise.
# Getting data from the web page
After receiving the user input, we need to make a request to the web page to retrieve the information we need. In most cases, we'll use got
from @/utils/got
(a customized got (opens new window) wrapper) to make HTTP requests. You can find more information on how to use got in the got documentation (opens new window).
To begin, we'll make an HTTP GET request to the API and load the HTML response into Cheerio, a library that helps us parse and manipulate HTML.
const baseUrl = 'https://github.com';
const { user, repo = 'RSSHub' } = ctx.params;
// Note that the ".data" property contains the full HTML source of the target page returned by the request
const { data: response } = await got(`${baseUrl}/${user}/${repo}/issues`);
const $ = cheerio.load(response);
2
3
4
5
6
Next, we'll use Cheerio selectors to select the relevant HTML elements, parse the data we need, and convert it into an array.
// We use a Cheerio selector to select all 'div' elements with the class name 'js-navigation-container'
// that contain child elements with the class name 'flex-auto'.
const item = $('div.js-navigation-container .flex-auto')
// We use the `toArray()` method to retrieve all the DOM elements selected as an array.
.toArray()
// We use the `map()` method to traverse the array and parse the data we need from each element.
.map((item) => {
item = $(item);
const a = item.find('a').first();
return {
title: a.text(),
// We need an absolute URL for `link`, but `a.attr('href')` returns a relative URL.
link: `${baseUrl}${a.attr('href')}`,
pubDate: parseDate(item.find('relative-time').attr('datetime')),
author: item.find('.opened-by a').text(),
category: item
.find('a[id^=label]')
.toArray()
.map((item) => $(item).text()),
};
});
ctx.state.data = {
// Your RSS output here
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Outputting the RSS
Once we have the data from the web page, we need to further process it to generate RSS in accordance with the RSS specification. Mainly, we need the channel title, channel link, item title, item link, item description, and item publication date.
Assign them to the ctx.state.data
object, and RSSHub's middleware will take care of the rest.
Here's an example code:
const got = require('@/utils/got');
const cheerio = require('cheerio');
const { parseDate } = require('@/utils/parse-date');
module.exports = async (ctx) => {
const baseUrl = 'https://github.com';
const { user, repo = 'RSSHub' } = ctx.params;
const { data: response } = await got(`${baseUrl}/${user}/${repo}/issues`);
const $ = cheerio.load(response);
const item = $('div.js-navigation-container .flex-auto')
.toArray()
.map((item) => {
item = $(item);
const a = item.find('a').first();
return {
title: a.text(),
link: `${baseUrl}${a.attr('href')}`,
pubDate: parseDate(item.find('relative-time').attr('datetime')),
author: item.find('.opened-by a').text(),
category: item
.find('a[id^=label]')
.toArray()
.map((item) => $(item).text()),
};
});
ctx.state.data = {
// channel title
title: `${user}/${repo} issues`,
// channel link
link: `${baseUrl}/${user}/${repo}/issues`,
// each feed item
item: items,
};
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Better Reading Experience
The previous code provides only part of the information for each feed item. To provide a better reading experience, we can add the full article to each feed item, in this case the issue body.
Here's the updated code:
const got = require('@/utils/got');
const cheerio = require('cheerio');
const { parseDate } = require('@/utils/parse-date');
module.exports = async (ctx) => {
const baseUrl = 'https://github.com';
const { user, repo = 'RSSHub' } = ctx.params;
const { data: response } = await got(`${baseUrl}/${user}/${repo}/issues`);
const $ = cheerio.load(response);
const list = $('div.js-navigation-container .flex-auto')
.toArray()
.map((item) => {
item = $(item);
const a = item.find('a').first();
return {
title: a.text(),
link: `${baseUrl}${a.attr('href')}`,
pubDate: parseDate(item.find('relative-time').attr('datetime')),
author: item.find('.opened-by a').text(),
category: item
.find('a[id^=label]')
.toArray()
.map((item) => $(item).text()),
};
});
const items = await Promise.all(
list.map((item) =>
ctx.cache.tryGet(item.link, async () => {
const { data: response } = await got(item.link);
const $ = cheerio.load(response);
// Select the first element with the class name 'comment-body'
item.description = $('.comment-body').first().html();
// Every property of a list item defined above is reused here
// and we add a new property 'description'
return item;
})
)
);
ctx.state.data = {
title: `${user}/${repo} issues`,
link: `https://github.com/${user}/${repo}/issues`,
item: items,
};
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Now the RSS feed will have a similar reading experience to the original website.
Note
Note that in the previous section, we only needed to send one HTTP request using an API to get all the data we needed. However, in this section, we need to send 1 + n
HTTP requests, where n
is the number of feed items in the list from the first request.
Some websites may not want to receive too many requests in a short amount of time, which can cause them to return an error message like 429 Too Many Requests
.
# Using the common configured route
# Create the main file
First, we need a few data:
- The RSS source link
- The data source link
- The RSS feed title (not the title of individual items)
Open your code editor and create a new file. Since we're going to create an RSS feed for GitHub issues, it's suggested that you save the file as issue.js
, but you can name it whatever you like.
Here's some basic code to get you started:
// Import necessary modules
const buildData = require('@/utils/common-config');
module.exports = async (ctx) => {
ctx.state.data = await buildData({
link: '', // The RSS source link
url: '', // The data source link
// Variables can be used here, such as %xxx% will be parsed into
// variables with values of the same name under **params**
title: '%title%',
params: {
title: '', // Additional title
},
});
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Our RSS feed currently lacks content. The item
must be set to add the content. Here's an example:
const buildData = require('@/utils/common-config');
module.exports = async (ctx) => {
const { user, repo = 'RSSHub' } = ctx.params;
const link = `https://github.com/${user}/${repo}/issues`;
ctx.state.data = await buildData({
link,
url: link,
title: `${user}/${repo} issues`, // you can also use $('head title').text()
params: {
title: `${user}/${repo} issues`,
baseUrl: 'https://github.com',
},
item: {
item: 'div.js-navigation-container .flex-auto',
// You need to use template literals if you want to use variables
title: `$('a').first().text() + ' - %title%'`, // Only supports js statements like $().xxx()
link: `'%baseUrl%' + $('a').first().attr('href')`, // .text() means get the text of the element
// description: ..., we don't have description for now
pubDate: `parseDate($('relative-time').attr('datetime'))`,
},
});
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
You'll notice that the code is similar to the Obtaining data from the webpage section above. However, this RSS feed doesn't contain the full article of the issue.
# Retrieving full articles
To get the full article of each issue, you need to add a few more lines of code. Here is an example:
const buildData = require('@/utils/common-config');
const got = require('@/utils/got');
const cheerio = require('cheerio');
module.exports = async (ctx) => {
const { user, repo = 'RSSHub' } = ctx.params;
const link = `https://github.com/${user}/${repo}/issues`;
ctx.state.data = await buildData({
link,
url: link,
title: `${user}/${repo} issues`,
params: {
title: `${user}/${repo} issues`,
baseUrl: 'https://github.com',
},
item: {
item: 'div.js-navigation-container .flex-auto',
title: `$('a').first().text() + ' - %title%'`,
link: `'%baseUrl%' + $('a').first().attr('href')`,
pubDate: `parseDate($('relative-time').attr('datetime'))`,
},
});
await Promise.all(
ctx.state.data.item.map((item) =>
ctx.cache.tryGet(item.link, async () => {
const { data: resonse } = await got(item.link);
const $ = cheerio.load(resonse);
item.description = $('.comment-body').first().html();
return item;
})
)
);
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
You can see that the above code is very similar to the previous section which retrieves full articles by adding a few more lines of code. It is recommended that you use the method in the previous section whenever possible, as it is more flexible than using @/utils/common-config
.
# Using puppeteer
Using puppeteer is another approach to obtain data from websites. However, it is recommended that you try the above methods first. It is also recommended that you read via HTML web page using got first since this section is an extension of the previous section and will not explain some basic concepts.
# Creat the main file
To get started with puppeteer, create a new file in your code editor and save it with an appropriate name, such as issue.js
. Then, require the necessary modules and set up the basic structure of the function:
// Require some useful modules
const cheerio = require('cheerio'); // an HTML parser with a jQuery-like API
const { parseDate } = require('@/utils/parse-date');
const logger = require('@/utils/logger');
module.exports = async (ctx) => {
// Your logic here
ctx.state.data = {
// Your RSS output here
};
};
2
3
4
5
6
7
8
9
10
11
12
# Replace got with puppeteer
Now, we will be using puppeteer
instead of got
to retrieve data from the web page.
# Retrieving full articles
Retrieving the full articles of each issue using a new browser page is similar to the previous section. We can use the following code:
const cheerio = require('cheerio');
const { parseDate } = require('@/utils/parse-date');
const logger = require('@/utils/logger');
module.exports = async (ctx) => {
const baseUrl = 'https://github.com';
const { user, repo = 'RSSHub' } = ctx.params;
const browser = await require('@/utils/puppeteer')();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', (request) => {
request.resourceType() === 'document' ? request.continue() : request.abort();
});
const link = `${baseUrl}/${user}/${repo}/issues`;
logger.debug(`Requesting ${link}`);
await page.goto(link, {
waitUntil: 'domcontentloaded',
});
const response = await page.content();
page.close();
const $ = cheerio.load(response);
const list = $('div.js-navigation-container .flex-auto')
.toArray()
.map((item) => {
item = $(item);
const a = item.find('a').first();
return {
title: a.text(),
link: `${baseUrl}${a.attr('href')}`,
pubDate: parseDate(item.find('relative-time').attr('datetime')),
author: item.find('.opened-by a').text(),
category: item
.find('a[id^=label]')
.toArray()
.map((item) => $(item).text()),
};
});
const items = await Promise.all(
list.map((item) =>
ctx.cache.tryGet(item.link, async () => {
// reuse the browser instance and open a new tab
const page = await browser.newPage();
// set up request interception to only allow document requests
await page.setRequestInterception(true);
page.on('request', (request) => {
request.resourceType() === 'document' ? request.continue() : request.abort();
});
logger.debug(`Requesting ${item.link}`);
await page.goto(item.link, {
waitUntil: 'domcontentloaded',
});
const response = await page.content();
// close the tab after retrieving the HTML content
page.close();
const $ = cheerio.load(response);
item.description = $('.comment-body').first().html();
return item;
})
)
);
// close the browser instance after all requests are done
browser.close();
ctx.state.data = {
title: `${user}/${repo} issues`,
link: `https://github.com/${user}/${repo}/issues`,
item: items,
};
};
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Additional Resources
Here are some resources you can use to learn more about puppeteer:
# Intercepting requests
When scraping web pages, you may encounter images, fonts, and other resources that you don't need. These resources can slow down the page load time and use up valuable CPU and memory resources. To avoid this, you can enable request interception in puppeteer.
Here's how to do it:
await page.setRequestInterception(true);
page.on('request', (request) => {
request.resourceType() === 'document' ? request.continue() : request.abort();
});
// These two statements must be placed before page.goto()
2
3
4
5
You can find all the possible values of request.resourceType()
here (opens new window). When using these values in your code, make sure to use lowercase letters.
# Wait Until
In the code above, you'll see that waitUntil: 'domcontentloaded'
is used in the page.goto() function. This is a Puppeteer option that tells it when to consider a navigation successful. You can find all the possible values and their meanings here (opens new window).
It's worth noting that domcontentloaded
waits for a shorter time than the default value load
, and networkidle0
may not be suitable for websites that keep sending background telemetry or fetching data.
Additionally, it's important to avoid waiting for a specific timeout and instead wait for a selector to appear. Waiting for a timeout is inaccurate, as it depends on the load of the Puppeteer instance.