luke.b//blog

the journal scrapper

~ #journal

The next phase of journal will require a mechanism to take events from the Matrix network and write them to blog files in the journal web articles directory.

So far, I had assumed that using a Matrix “bot” user would be sufficient for pulling blog content in this way but in hindsight, a Matrix Application Service would be more applicable.

Matrix App Services

If you’re unfamiliar with App Services or “AS’s”, these are one way of writing a third-party integration with Matrix. Their limitation is that the Matrix home server must be aware of the AS at runtime in order to handle requests that pertain to that AS.

Typically, requests are filtered by Matrix User ID and Matrix Room alias. A typical implementation would insert a prefix before user IDs and room aliases, indicating the ownership of an AS.

For example:

@_journal_bot_12345:homeserver.bla
#_journal_blog_lukesblog:homeserver.bla

These prefixes are indicated to a home server via registration file, which is generated prior to the HS running. They are used when certain requests are made:

  • inviting a user to a room
  • querying for third party channel IDs
  • etc.

The AS exposes an API to the HS that is used when the related users/rooms are queried.

journal AS

The journal AS “bot” user (which serves any purpose as defined by the AS) would scrape messages from any room that it is invited to. The blog content of each scrapped message would be written to a file in the journal article dircetory.

This has the advantage of not requiring a room alias, which is best suited to bridging message channels in third-party namespaces.

The drawback of this is that it requires a Matrix HS to be registered to a journal AS. For full decentralisation, this is required per-blog which could be a lot of effort to go to. Hopefully things can be made easier through dedicated docker files and plenty of documentation on how to get things running.

journal “delegated bot”

An alternative is getting the journal admin to create a bot account by following the registration process against a home server (through a web UI) and copying the access token to the configuration file for the journal scrapper. The password could be randomized and discarded to avoid being used to actually log in.

This is mostly security by throwing away the keys but does provide a mechanism for creating a bot user against an arbitrary home server, even if the journal admin doesn’t own the HS itself.

Fin

As you can tell, things are very much hypothetical at this stage but there’s probably enough for me to get started on a POC blog scrapper.

Thanks for reading. Don’t forget to watch & star the GitHub project.