ok, experimenting on a massive test data set of over 5 million posts… this PostgreSQL works pretty well
SELECTCOUNT(ranked_recency.*) AS post_row_count
FROM
(
SELECT id, community_id, published,
rank() OVER (
PARTITIONBY community_id
ORDERBY published DESC, id DESC
)
FROM post_aggregates) ranked_recency
WHERE rank <=1000
;
This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.
I don’t understand what you are trying to do here. What problem are you trying to solve? Is something wrong with having an index on that rank function, then just using a simple WHERE? Why do you want to limit to 1000 posts? If I want to look through 1000 posts in a community, I probably want to look at more than 1000.
Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I’ve been here testing the big servers since May 2023.
If I want to look through 1000 posts in a community, I probably want to look at more than 1000.
I’m well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I’m trying to build some kind of basic sanity logic into the system so it doesn’t plow through 5 million rows to do a LIMIT 10 query.
Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).
The programmers who built the code for over 4 years don’t seem to think it is an urgent problem. So I’m chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I’ve done on major projects. I’m doing this because I feel like nobody else has for months.
Hmm, thanks. I don’t know if this discussion is helpful to you (if not, I’ll drop out and stay out of your way) but I feel like something is fundamentally broken if any sane user actions result in table scans like that, and decreasing the size of the scan with a WHERE clause seems like a band-aid. Browsing a community should be SELECT whatever FROM threads WHERE sortfield > token ORDER BY sortfield LIMIT 100; or something along those lines, with maybe a JOIN to filter out posts from blocked users and that sort of thing. There are only a few possible sort fields and they should all be indexed so the above query should have no scans at all. You might have to denormalize the comment count (put it in a column in the threads table that gets updated when a new comment is posted) so that you can index it instead of sorting on COUNT of some joined select. There will usually not be a ton of threads being updated at onced, so pg’s built in caching should keep it all pretty fast.
Disclosure, I’m completely unfamiliar with the Lemmy code and don’t have any significant Postgres experience (I’ve mostly used MySQL and SQLite) so I may have table names and stuff wrong above. But, I think it’s important to keep in mind that Usenet servers of the 1980s handled far more traffic than any current Lemmy server, with hardware 1/1000th of the speed, so if things are slow on Lemmy it’s probably best to look for fundamental issues with the queries and schema.
I guess the presence of an ORM is an antipattern in its own right and maybe Lemmy’s devs should aim to get rid of it.
You might have to denormalize the comment count (put it in a column in the threads table
When browsing a community, the concern is ‘post’, not ‘comment’ or threads. So this doesn’t really come into play. If anything, Reddit/Lemmy style system is focused on the latest vote count, not the number of comments…
Again though, that sounds like something that can be indexed. And posts more than a few days old usually won’t receive new voted very often. I think Reddit may archive very old posts so they can’t receive be votes at all. Are these table scans only an issue when the person is trying to browse by best of all time?
For recent posts, yeah, a bit of buffering could help if votes are arriving very fast. That seems like an eventual good optimization. For now there is not enough traffic to need it, I’d expect.
Again though, that sounds like something that can be indexed.
I don’t get what you are suggesting. There are INDEX.
The problem being addressed is that there is no WHERE clause that actually limits the posts.
JOIN is done on a table without first eliminating rows… and that worked OK when there was only 50,000 posts in the database, but now that it is over 1 million rows - it is causing major performance problems.
Sorry about the slow response. What I mean is, suppose you have a column with 1000 integers (27,5,100,60,…). You want to print the top 10:
SELECT num FROM xyz ORDER BY num DESC LIMIT 10;
The db has to scan all 1000 rows to find the 10 biggest numbers. Now suppose instead there is an index on that column, i.e. a btree that lets you search for a value with very few operations, or traverse the list in order. Now the SELECT doesn’t have to examine all the rows. It only has to traverse 10 items from the index, starting at the large end. It does mean that UPDATE and INSERT operations for that columb become more expensive, since the index has to be updated too, but that too is less expensive than a table scan.
I’m saying that by having similar indexes on the possible sorting orders of read queries, you can likewise get rid of all the table scans. Does that make sense?
Similarly if you JOIN two indexed fields, that is like merging two sorted lists. The db can traverse both indexes in parallel to find the matching values. Db’s can be very clever about stuff like this. It helps though if you use EXPLAIN to make sure they are doing the right thing.
As the subject of the post says… it is JOIN behavior that’s the problem. The queries work perfectly fine when you ask for posts without doing JOIN to a bunch of empty tables.
An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns: sort_type ENUM, lowest_id BigInt, highest_id BigInt
Run a variation of this to populate that table:
FROM
(
SELECT id, community_id, published,
rank() OVER (
PARTITIONBY community_id
ORDERBY published DESC, id DESC
)
FROM post_aggregates) ranked_recency
WHERE rank <=1000
Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id
That would put in a basic sanity check that ages-out content, and it would be right against the primary key!
A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.
Using a smallint also gives some flexibility (or a new field if going with the id min max approach)… if page greater than 10 for a particular sort, go to include > 1 and fall into tiers.
ok, experimenting on a massive test data set of over 5 million posts… this PostgreSQL works pretty well
SELECT COUNT(ranked_recency.*) AS post_row_count FROM ( SELECT id, community_id, published, rank() OVER ( PARTITION BY community_id ORDER BY published DESC, id DESC ) FROM post_aggregates) ranked_recency WHERE rank <= 1000 ;
This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.
I don’t understand what you are trying to do here. What problem are you trying to solve? Is something wrong with having an index on that rank function, then just using a simple WHERE? Why do you want to limit to 1000 posts? If I want to look through 1000 posts in a community, I probably want to look at more than 1000.
Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I’ve been here testing the big servers since May 2023.
I’m well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I’m trying to build some kind of basic sanity logic into the system so it doesn’t plow through 5 million rows to do a LIMIT 10 query.
Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).
The programmers who built the code for over 4 years don’t seem to think it is an urgent problem. So I’m chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I’ve done on major projects. I’m doing this because I feel like nobody else has for months.
Hmm, thanks. I don’t know if this discussion is helpful to you (if not, I’ll drop out and stay out of your way) but I feel like something is fundamentally broken if any sane user actions result in table scans like that, and decreasing the size of the scan with a WHERE clause seems like a band-aid. Browsing a community should be SELECT whatever FROM threads WHERE sortfield > token ORDER BY sortfield LIMIT 100; or something along those lines, with maybe a JOIN to filter out posts from blocked users and that sort of thing. There are only a few possible sort fields and they should all be indexed so the above query should have no scans at all. You might have to denormalize the comment count (put it in a column in the threads table that gets updated when a new comment is posted) so that you can index it instead of sorting on COUNT of some joined select. There will usually not be a ton of threads being updated at onced, so pg’s built in caching should keep it all pretty fast.
Disclosure, I’m completely unfamiliar with the Lemmy code and don’t have any significant Postgres experience (I’ve mostly used MySQL and SQLite) so I may have table names and stuff wrong above. But, I think it’s important to keep in mind that Usenet servers of the 1980s handled far more traffic than any current Lemmy server, with hardware 1/1000th of the speed, so if things are slow on Lemmy it’s probably best to look for fundamental issues with the queries and schema.
I guess the presence of an ORM is an antipattern in its own right and maybe Lemmy’s devs should aim to get rid of it.
When browsing a community, the concern is ‘post’, not ‘comment’ or threads. So this doesn’t really come into play. If anything, Reddit/Lemmy style system is focused on the latest vote count, not the number of comments…
Again though, that sounds like something that can be indexed. And posts more than a few days old usually won’t receive new voted very often. I think Reddit may archive very old posts so they can’t receive be votes at all. Are these table scans only an issue when the person is trying to browse by best of all time?
For recent posts, yeah, a bit of buffering could help if votes are arriving very fast. That seems like an eventual good optimization. For now there is not enough traffic to need it, I’d expect.
I don’t get what you are suggesting. There are INDEX.
The problem being addressed is that there is no WHERE clause that actually limits the posts.
JOIN is done on a table without first eliminating rows… and that worked OK when there was only 50,000 posts in the database, but now that it is over 1 million rows - it is causing major performance problems.
Sorry about the slow response. What I mean is, suppose you have a column with 1000 integers (27,5,100,60,…). You want to print the top 10:
The db has to scan all 1000 rows to find the 10 biggest numbers. Now suppose instead there is an index on that column, i.e. a btree that lets you search for a value with very few operations, or traverse the list in order. Now the SELECT doesn’t have to examine all the rows. It only has to traverse 10 items from the index, starting at the large end. It does mean that UPDATE and INSERT operations for that columb become more expensive, since the index has to be updated too, but that too is less expensive than a table scan.
I’m saying that by having similar indexes on the possible sorting orders of read queries, you can likewise get rid of all the table scans. Does that make sense?
Similarly if you JOIN two indexed fields, that is like merging two sorted lists. The db can traverse both indexes in parallel to find the matching values. Db’s can be very clever about stuff like this. It helps though if you use EXPLAIN to make sure they are doing the right thing.
It makes sense, but there are indexes.
As the subject of the post says… it is JOIN behavior that’s the problem. The queries work perfectly fine when you ask for posts without doing JOIN to a bunch of empty tables.
3 hours later… I put it into code and am experimenting with it. Some proof of concept results: https://github.com/LemmyNet/lemmy/files/12373819/auto_explain_list_post_community_0_18_4_dullbananas_with_inclusion_run0a.txt
An even less-intrusive approach is to not add any new field to existing tables. Establish a reference table say called include_range. There is already an ENUM value for each sort type, so include_range table with these columns:
sort_type ENUM, lowest_id BigInt, highest_id BigInt
Run a variation of this to populate that table:
FROM ( SELECT id, community_id, published, rank() OVER ( PARTITION BY community_id ORDER BY published DESC, id DESC ) FROM post_aggregates) ranked_recency WHERE rank <= 1000
Against every sort order, including OLD. Capture only two BigInt results: the MIN(id) and the MAX(id) - that will give a range over the whole table. Then every SELECT on post_aggregates / post table includes a WHERE id >= lowest_id AND id <= highest_id
That would put in a basic sanity check that ages-out content, and it would be right against the primary key!
A core design issue of either approach is that server operators can modify the building of this data without needing to modify or restart the lemmy_server Rust code.
Using a smallint also gives some flexibility (or a new field if going with the id min max approach)… if page greater than 10 for a particular sort, go to include > 1 and fall into tiers.