Full-text search your personal Mastodon posts with R

Deal Score0
Deal Score0


Whether or not you’ve got totally migrated from Twitter to Mastodon, are simply attempting out the “fediverse,” or have been a longtime Mastodon consumer, you could miss having the ability to search by means of the complete textual content of “toots” (also referred to as posts). In Mastodon, hashtags are searchable however different, non-hashtag textual content will not be. The unavailability of full-text search lets customers management how a lot of their content material is well discoverable by strangers. However what if you’d like to have the ability to search your personal posts?

Some Mastodon cases permit customers to do full-text searches of their very own toots however others do not, relying on the admin. Luckily, it is simple to full-text search your personal Mastodon posts, because of R and the rtoot package deal developed by David Schoch. That is what this text is about.

Arrange a full-text search

First, set up the rtoot package deal if it isn’t already in your system with set up.packages("rtoot"). I will even be utilizing the dplyr and DT packages. All three could be loaded with the next command:


# set up.packages("rtoot") # if wanted
library(rtoot)
library(dplyr)
library(DT)

Subsequent, you may want your Mastodon ID, which isn’t the identical as your consumer title and occasion. The rtoot package deal features a strategy to search throughout the fediverse for accounts. That is a useful gizmo if you wish to see if somebody has an account anyplace on Mastodon. However because it additionally returns account IDs, you should use it to seek out your personal ID, too.

To seek for my very own ID, I would use:


accounts <- search_accounts("smach@fosstodon.org")

That can doubtless solely convey again a dataframe with one end result. Should you solely seek for a consumer title and no occasion, comparable to search_accounts("posit") to see if Posit (previously RStudio) is energetic on Mastodon, there may very well be extra outcomes. 

My search had just one end result, so my ID is the primary (in addition to solely) merchandise within the id column:


my_id <- accounts$id[1]

I can now retrieve my posts with rtoot‘s get_account_statuses() perform.

Pull and save your information

The default returns 20 outcomes, a minimum of for now, although the restrict seems to be rather a lot larger should you set it manually with the restrict argument. Do be sort about making the most of this setting, nevertheless, since most Mastodon cases are run by volunteers going through vastly elevated internet hosting prices not too long ago.

The primary time you attempt to pull your personal information, you may be requested to authenticate. I ran the next to get my most up-to-date 50 posts (notice using verbose = TRUE to see any messages that is perhaps returned):


smach_statuses <- get_account_statuses(my_id, restrict = 50, verbose = TRUE)

Subsequent, I used to be requested if I needed to authenticate. After selecting sure, I obtained the next question:


On which occasion do you wish to authenticate (e.g., "mastodon.social")? 

Subsequent, I used to be requested:


What sort of token would you like? 

1: public
2: consumer

Since I need the authority to see all exercise in my very own account, I selected consumer. The package deal then saved an authentication token for me and I may then run get_account_statuses().

The ensuing information body—which was truly a tibble, a particular sort of information body utilized by tidyverse packages—contains 29 columns. A couple of are list-columns comparable to account and media_attachments with non-atomic outcomes, that means outcomes should not in a strict two-dimensional format. 

I recommend saving this end result earlier than going additional so that you need not re-ping the server in case one thing goes awry along with your R session or code. I often use saveRDS, like so:


saveRDS(smach_statuses, "smach_statuses.Rds")

Attempting to avoid wasting outcomes as a parquet file doesn’t work as a result of advanced listing columns. Utilizing the vroom package deal to avoid wasting as a CSV file works and contains the complete textual content of the listing columns. Nevertheless, I would reasonably save as a local .Rds or .Rdata file.

Create a searchable desk along with your outcomes

If all you need is a searchable desk for full-text looking out, you solely want a number of of these 29 columns. You’ll positively need created_at, url, spoiler_text (should you use content material warnings and wish these in your desk), and content material. Should you miss seeing engagement metrics in your posts, add reblogs_count, favourites_count, and replies_count.

Under is the code I exploit to create information for a searchable desk for my very own viewing. I added a URL column to create a clickable >> with the URL of the put up, which I then add to the top of every put up’s content material. That makes it simple to click on by means of to the unique model:


tabledata <- smach_statuses |>
filter(content material != "") |>
# filter(visibility == "public") |> # If you wish to make this public someplace. Default contains direct messages.
mutate(
url = paste0("<a goal="clean" href="", uri,""><robust> >></robust></a>"),
content material = paste(content material, url),
created_at := as.character(as.POSIXct(created_at, format = "%Y-%m-%d %H:%M UTC"))
) |>
choose(CreatedAt = created_at, Put up = content material, Replies = replies_count, Favorites = favourites_count, Boosts = reblogs_count)

If I have been sharing this desk publicly, I would be certain to uncomment filter(visibility == "public") so solely my public posts have been obtainable. The information returned by get_account_statuses() to your personal account contains posts which might be unlisted (obtainable to anybody who finds them however not on public timelines by default) in addition to these which might be set for followers solely or direct messages.

There are numerous methods to show this information right into a searchable desk. A method is with the DT package. The code under creates an interactive HTML desk with search filter packing containers that may use common expressions. (See Do more with R: Quick interactive HTML tables to be taught extra about utilizing DT.) 


DT::datatable(tabledata, filter="prime", escape = FALSE, rownames = FALSE, 
choices = listing(
search = listing(regex = TRUE, caseInsensitive = TRUE),
pageLength = 20,
lengthMenu = c(25, 50, 100),
autowidth = TRUE,
columnDefs = listing(listing(width="80%", targets = listing(2)))
))

Here is a screenshot of the ensuing desk:

Table showing some of my Mastodon posts filtered for the #rstats tag in a search box Sharon Machlis

An interactive desk of my Mastodon posts. This desk was created with the DT R package deal utilizing rtoot.

How you can pull in new Mastodon posts

It is simple to replace your information to tug new posts, as a result of the get_account_statuses() perform features a since_id argument. To start out, discover the utmost ID from the prevailing information:


max_id <- max(smach_statuses$id)

Subsequent, search an replace with all of the posts because the max_id:


new_statuses <- get_account_statuses(my_id, since_id = max_id, 
restrict = 10, verbose = TRUE)
all_statuses <- bind_rows(new_statuses, smach_statuses)

If you wish to see up to date engagement metrics for some latest posts in present information, I would recommend getting the final 10 or 20 total posts as a substitute of utilizing since_id. You’ll be able to then mix that with the prevailing information and dedupe by conserving the primary merchandise. Right here is a method to do this:


new_statuses <- get_account_statuses(my_id, restrict = 25, verbose = TRUE)
all_statuses <- bind_rows(new_statuses, smach_statuses) |>
distinct(id, .keep_all = TRUE)

How you can learn your downloaded Mastodon archive

There’s one other strategy to get all of your posts, which is especially helpful should you’ve been on Mastodon for a while and have numerous exercise over that interval. You’ll be able to obtain your Mastodon archive from the web site.

Within the Mastodon internet interface, click on the little gear icon above the left column for Settings, then Import and export > Knowledge export. It’s best to see an choice to obtain an archive of your posts and media. You’ll be able to solely request an archive as soon as each seven days, although, and it’ll not embody any engagement metrics.

When you obtain the archive, you may unpack it manually or, as I want, use the archive package deal (obtainable on CRAN) to extract the information. I will additionally load the jsonlite, stringr, and tidyr packages earlier than extracting information from the archive:


library(archive)
library(jsonlite)
library(stringr)
library(tidyr)
archive_extract("name-of-your-archive-file.tar.gz")

Subsequent, you may wish to take a look at outbox.json‘s orderItems. Here is how I imported that into R:


my_outbox <- fromJSON("outbox.json")[["orderedItems"]]
my_posts <- my_outbox |>
unnest_wider(object, names_sep = "_")

From there, I created an information set for a searchable desk much like the one from the rtoot outcomes. This archive contains all exercise, comparable to favoriting one other put up, which is why I am filtering each for sort Create and to verify object_content has a price. As earlier than, I add a >> clickable URL to the put up content material and tweak how dates are displayed:


search_table_data <- my_posts |>
filter(sort == "Create") |>
filter(!is.na(object_content)) |>
mutate(
url = paste0("<a goal="clean" href="", object_url,""><robust> >></robust></a>")
) |>
rename(CreatedAt = revealed, Put up = object_content) |>
mutate(CreatedAt = str_replace_all(CreatedAt, "T", " "),
CreatedAt = str_replace_all(CreatedAt, "Z", " "),
Put up = str_replace(Put up, "</p>$", " "),
Put up = paste0(Put up, "&nbsp;&nbsp;", url, "</p>")
) |>
choose(CreatedAt, Put up) |>
prepare(desc(CreatedAt))

Then, it is one other simple single perform to make a searchable desk with DT:


datatable(search_table_data, rownames = FALSE, escape = FALSE, 
filter="prime", choices = listing(search = listing(regex = TRUE)))

That is useful to your personal use, however I would not use archive outcomes to share publicly, because it’s much less apparent which of those might need been personal messages (you’d must do some filtering on the to column).

In case you have any questions or feedback about this text, you could find me on Mastodon at smach@fosstodon.org in addition to sometimes nonetheless on Twitter at @sharon000  (though I am undecided for the way for much longer). I am additionally on LinkedIn.

For extra R suggestions, head to InfoWorld’s Do More With R page.

Copyright © 2022 IDG Communications, Inc.



We will be happy to hear your thoughts

Leave a reply

informatify.net
Logo
Enable registration in settings - general