MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

dajoker · #1

Lenovo ThinkCentre
3.46 GHz processor, 10 cores reported
8 GiB RAM
232 GiB HDD with 190+ GiB free still; disk defragment process recently completed showing minimal fragmentation
MLS data on disk: 190 MiB
Internet connection: 100+ Mbit up and down

The specs above are not terrible on their own, particularly considering the Internet connection (thanks Google Fiber). For most things the computer acts like a slow windows (pardon the redundancy) computer, but the problem surfaces when we synchronize for the first time after a break. That break might be a week, or it might be some shorter period of time, but long enough that there is apparently a bit of data to be sent to MLS. That's fine, expected even, but the way MLS writes the data to disk is very troubling.

How long should synchronizing take on an otherwise okay system with a fast Internet connection? If I were to download 190 MiB on this connection it could in theory be done in a mere 15.2 seconds. To test this, I just downloaded Java 13 for Linux, which came in at 180 MiB, and it took almost exactly thirty (30) seconds, which accounts for all kinds of delays as the file came from somewhere else in the country. Not too bad, testing the network, processor, RAM, disk, antivirus software, and all other components outside MLS, and since I have a total MLS data size just slightly bigger than that, it would be fine with me if synchronizing took that long.

How long should my download and write to disk take, though? I probably cannot get a single nice tarball from Church HQ with the stake data, so instead there might be a need to request a dozen wards, one at a time, to combine to make up the stake. That may also not be how the APIs work, and instead MLS might actually be querying for a big list of users and then pulling down their individual and household records one at a time, meaning there are now a few thousand queries to do; that, too, would be fine with me, and it might even be how the whole thing works. Still, that number of queries, and the total volume of data, should result in a couple minutes at most, or at least that seems reasonable to me. As another data point, I am able to backup MLS to disk in about ten (10) to fifteen (15) seconds, with an actual file write time of just four (4) seconds, which is fine with me considering the backup size is 80 MiB today (granted, there is probably some caching going on delaying writes so making things seem to go more-quickly than they actually do, but that's expected).

The actual sync time today was fifty-three (53) minutes, enough time for this same Internet connection to download thirty-nine (39) GiB of data, or about ten (10) full DVDs. This is not unusual for MLS, sadly, and was not the case in the past a year or so ago. During this synchronization the Internet connection was not bogged down or even pushed at all, so I do not think MLS is doing anything terribly inefficient on the wire, and as mentioned before the system as a whole (and even MLS's own backup) can write the total data size in seconds, so it's not the MLS-running system, or something inherently problematic with MLS as a whole.

Because troubleshooting is what I do, I decided to look at Task Manager and quickly saw that the disk utilization was really high, but read and write speeds were really low. During a sync I expect disk usage, but not for this time period, and not for this small quantity of data, so that was suspect. Of course finding out the details on windows is hard because Task Manager is not a power tool, so I added Process Monitor (ProcMon, from Sysinternals) and quickly found the problem: low disk buffering, and a lot of random (vs. sequential) file accesses.

For those who are not IT people by training, "buffering" is what computers do in many contexts, not just when watching General Conference or listening to hymns online. Processes buffer data to be written to, or read from, disk in chunks for more efficient writes and reads. Perhaps an analogy can help: when drying dishes and putting them into a drawer you can do so one spoon, one fork, and one knife at a time, and it will require a lot of overhead (moving your arm back and forth, using your eyes to locate things to move, and places to put them, etc.). Alternatively, if you plan on putting all the forks in the same spot, you can probably pick them all up, move your arm and eyes once, and drop them into that slot in the drawer. Programs do the same thing, though with data they write blocks of data, usually thousands or millions of them, at a time rather than writing them one by one. For a lot of technical reasons, doing them one-by-one on a hard drive, especially a spinning rust (vs. solid state) drive, is terribly slow, and that brings us to the root cause here.

Adding to that random file reads and writes exacerbates a bad situation. Back to the silverware analogy, it would be easier for you to do one each fork individually, then each spoon individually, then each knife individually, than it would be to do one fork, one knife, one fork, one spoon, one knife, etc., because you cannot keep using the same target location over and over. Computers work in a similar way, which is why disks report "seek times" as a performance measure. Non-solid-state drives suffer from slower seek times than solid state drives, so having small reads, randomly, is the best way to have poor performance (in fact, random reads and writes are a benchmarking method for this very reason, to see a worst case scenario).

MLS 3.9.2 (and some earlier versions), when synchronizing, is reading data coming off the wire to disk in very small blocks, but also reading from the existing files in small blocks, and in a non-sequential way. Comparing MLS writes to a backup (as shown via ProcMon) with those taking place during a Send/Receive (sync) clearly shows the process writing in different ways, thousands of times more efficiently when doing the backup. A recent backup had about 3000 write events, with most of those writing 11 to 64 KiB of data at a time; not surprisingly multiplying those together gives something in the range of the total data size for MLS.

Ironically, years ago (six (6) or so) I reported this same problem with the backup process as our backups started bloating well beyond the floppy disk size (when financial reports stopped being printed and mailed), and the ProcMon output at the time was the similar, but showing backups writing one byte at a time, and other processes running efficiently. Eventually, after a year or two (2), that problem went away, and backups went from taking nearly an hour to just taking a few seconds, as they do now.

To provide a few more pieces of data, I have some summary stats from ProcMon during different parts of MLS's operations:
Backing up:
Reads Writes Read Bytes Write Bytes
88093 3063 158946110 184023495

Updating membership:
Reads Writes Read Bytes Write Bytes
678328 2399 92149594 708941

Updating finance:
Reads Writes Read Bytes Write Bytes
1454116 1873 193097882 583196

While this does not directly show the randomness of reads during synchronization as can be seen with the raw ProcMon data, it hints at it by showing the number of reads per total byte count, and the number of writes per total byte count:

Backing up: 1,804 bytes/read and 60,079 bytes/write
Membership: 135 bytes/read and 295 bytes/write
Finance: 133 bytes/read and 311 bytes/write

The vast majority of the reads and writes during synchronization are to just a few files:
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_tab1805.data
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_tab1330.data
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_tab1318.data
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_idx0.data
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_tab1452.data
C:\ProgramData\LDS Church\MLS\data\units\unitNumberHere\data\mls_tab1210.data

The backup of course has basically one big file being written, but it reads from all over the place to gather the data for the backup; since the read data are for a backup, the reads themselves are not random, as the point is to take all the data, not parts from here and there, and write it to one central file, so while there are a ton of reads, they get the job done sequentially and, therefore, quickly enough.

Going out on a limb, I am guessing that the sync (send/receive) process is synchronizing individual members, one at a time, and possibly breaking the data down from there too (one member's calling at a time, one member's phone at a time), and as part of this process it might be comparing with the old data to see what needs to be updated; done properly, this could save writing data already in place, since John Doe doesn't change addresses, phones, callings, or family members every day, or every week, and typically writes are more expensive (computing-wise) than reads. Unfortunately, if this is the case, looking for individual members' data, one at a time, is probably causing the random-looking disk reads, and corresponding tiny disk writes. Because there are many members in the stake, and many data points for each one, this results in millions of random-looking reads, a few writes, and general slowness. Welcome to IT, and optimizing software to scale properly.

There might be a lot of ways to fix this, but that assumes the root cause is as theorized, and it is probably more complex, or more nuanced at least, than that. Performing the data update in memory and writing all at once is one option that could work as long as stakes do not become 30x larger than they are today. Using different APIs that actually do synchronize all data in one structure, vs. making thousands of querise for individual pieces of data, may also work, but if the APIs do not exist then that could be a large effort. Throwing MLS out the window and getting all functionality into LCR would be great, but there are barriers there too, like the need to accommodate those without Internet connections, and the need to move a lot of functionality to LCR still.

If i can help provide more information I'm happy to do so. For now, this process is causing a lot of wasted time for church leaders in our stake, to the point where I know a few have started their MLS send/receive before church so they have it done when they return to the computer after meetings are complete.

Aaron Burgemeister
Stake Assistant Clerk - Finances

#2

Since this is a user-to-user forum that the developers rarely frequent, you may wish to share your suggestions through one of the feedback pages at the Church website, which the developers will see, or through an MLS message.

russellhltn · #3

When talking to church HQ, rather than focus on the inefficiency, I'd focus on the time (nearly an hour to sync).

As an aside, I'll mention a couple of things. I find that some clerks don't allow "RTPatch" to run when starting MLS. That's how MLS is patched.

Secondly, you might try downloading MLS and doing an "over the top" install. That makes sure you have everything current - especially the Java install that MLS uses (it uses it's own separate copy, not anything installed by the user).

If that doesn't fix it, then I'd complain.

It's just I think I'd have seen more posts about 1 hour sync times if this was common.

#4

If it were me, I'd probably slap a $20 120GB SSD in there and transfer all the data to it. (OK, I've been known to bend some of the policies on occasion, but your time has to be worth something!)

scgallafent · #5

If you've got syncs taking 53 minutes, something is broken. If it were my ward, I would completely remove MLS and reinstall to make sure you have the most current version of MLS (someone may have skipped some updates) and a brand new database. Everything except your local budget allowances (by organizations) should be restored from headquarters. This would best be done when the GSC is available in case you need a token reset.

dajoker wrote:Going out on a limb, I am guessing that the sync (send/receive) process is synchronizing individual members, one at a time, and possibly breaking the data down from there too (one member's calling at a time, one member's phone at a time), and as part of this process it might be comparing with the old data to see what needs to be updated

MLS Is essentially managing a replicated slice of the membership database. In order to do that, it is receiving transactions from the master database with each change that was made to the data and applying those transactions to the database. Temple recommend activated? That's a transaction. Phone number changed? That's another transaction. Each of those requires updates to the database tables, to the indexes that managed those tables, and potentially to log files that manage the transactions.

Databases have data scattered in different blocks on disk. There are techniques that can be used to reduce that by structuring the data in certain ways, but database updates follow a general pattern of small data writes to random locations.

A stake is going to have more updates because you are receiving transactions with record updates for every member of the stake. There is also a process once a quarter where you get a full database refresh to help address situations where a transaction might not have been processed.

dpabst · #6

We have a Solid State Drive (SSD) and one of the latest mini ThinkCentre models, and It just took 35 minutes to synchronize/transmit MLS.
I'm the STS and Stake Finance Clerk. It's painful to wait that long just to do my calling. I avoid MLS now as much as I can, since I can do expenses on LCR (except for deposits, but those are less common at the stake level), but I'm still required to print finance statements for the stake president to sign.

Will starting fresh make any difference? Is there a database clean-up tool that might make it faster? It makes me feel like I'm on dial-up. In Dialer options, it's set to "Use Internet". I get 30 Mbps download and 5 Mbps upload on a speed test, so it's clearly not my internet speed, the computer, or the disk. It doesn't matter when I do it either, on a Sunday or a Tuesday (today).

If I could just walk away and synchronize, I'd do that, but then someone else would have access to more than they should if they happened to have computer access since I'd be still signed into MLS.

Maybe I just need to read 2 Peter 1:6 a few more times.

eblood66 · #7

dpabst wrote:but I'm still required to print finance statements for the stake president to sign.

I don't know whether there's anything you can do about the rest but you should be able to print finance statements from LCR using Finance > Reports > Financial Statements. At least the ward statements are there and I would assume the stake ones are too. The only thing I use MLS for now is donation batches.

#8

dpabst wrote:I'm the STS and Stake Finance Clerk. It's painful to wait that long just to do my calling.

I would read the post just before yours by scgallafent and decide if it would be worth reloading MLS. I probably would do it.

scgallafent · #9

dpabst wrote:We have a Solid State Drive (SSD) and one of the latest mini ThinkCentre models, and It just took 35 minutes to synchronize/transmit MLS.
I'm the STS and Stake Finance Clerk. It's painful to wait that long just to do my calling. I avoid MLS now as much as I can, since I can do expenses on LCR (except for deposits, but those are less common at the stake level), but I'm still required to print finance statements for the stake president to sign.

Will starting fresh make any difference? Is there a database clean-up tool that might make it faster? It makes me feel like I'm on dial-up. In Dialer options, it's set to "Use Internet". I get 30 Mbps download and 5 Mbps upload on a speed test, so it's clearly not my internet speed, the computer, or the disk. It doesn't matter when I do it either, on a Sunday or a Tuesday (today).

If I could just walk away and synchronize, I'd do that, but then someone else would have access to more than they should if they happened to have computer access since I'd be still signed into MLS.

Starting fresh will help a little, but I'm sure how much of a difference you will see. The biggest issue is how frequently you are transmitting with MLS.

I looked at your logs to see what is going on. I'm going to cover some of what happens during the send/receive cycle and then address your specific situation.

MLS is an exercise in keeping distributed databases synchronized. The primary databases are the member and leader databases at Church headquarters. The local MLS database on your stake computer and the MLS databases on the computers in each of your ward clerk's offices are also involved in this.

When a change is made to a membership record, the information about that change is distributed to each of the computers involved in that cluster of databases. For example, if a new family moves into a ward, the ward clerk or bishop will normally complete the member move in Leader and Clerk Resources. The move triggers a whole series of actions in the central database (updating membership records with the new ward information, updating the address for the household, recording the move out of the previous ward) that need to be distributed to the individual MLS databases. As the process progresses, a series of "transactions" are recorded on the associated member records in the headquarters database.

Those transactions result in a series of change messages destined for the local MLS database that tell the local databases what need to be updated. Those are collected into a batch and staged for transmission to MLS. In the case of the member move above, there would be four batches of change records: move-out updates would go to the old ward and stake and move-in updates would go to the new ward and stake.

This happens for any kind of change to membership, callings, and finance data. These batches are staged for transmission as they are generated, so you gradually build up a pile of update batches between transmissions.

The accumulation of these update batches isn't generally a problem for wards because wards are dealing with a smaller number of members (fewer updates per week) and they generally transmit every week due to weekly donation processing. Looking back over my ward's logs for the last several months, it looks like our typically weekly transmission process takes between 60 and 90 seconds. That typically involves 5 to 10 membership update batches and 10 to 20 finance update batches. Once a quarter we send a full membership refresh to each ward to help ensure the databases stay in sync. That takes a little longer to process and our 60 to 90 second sequence typically runs between 3 and 4 minutes on those weeks.

Bandwidth doesn't make a big difference here. These are typically very small files (a few hundred bytes for most files), so transmission time is negligible. Each file is transmitted individually, so the cost of setting up and tearing down an individual file transfer dominates the time needed to transmit a file. On average, transferring a single update batch takes about a second, which is a function of the round-trip time for messages to get from MLS to the Church servers and back. You won't see much performance difference on a gigabit vs. 12 Mbps connection unless the ping times are significantly different.

Now to look specifically at your stake's situation:

Your most recent send/receive cycles were on July 14, August 25, and October 8. It looks like someone is logging into MLS about once a week but you are only transmitting about every six weeks.

Your July 24 transmission cycle took 18 minutes and 1 second and processed 215 batch updates.

Your August 25 transmission cycle took 22 minutes and 27 seconds and processed 358 batch updates.

For your October 8 transmission cycle, the log upload that happens during the transmission cycle happened about 14 minutes into the process. I can't see the logs with the full detail until your next transmission cycle when the log will be uploaded with the rest of the detail from the October 8 transmission cycle. There were 224 calling update batches, 73 finance update batches, and 178 membership update batches. Accounting for a little bit of administrative overhead (log files, config files, check MLS version, etc.), you probably came in at around 485 files.

File transfer time typically runs just under a second, so you have about nine minutes total of file transfer activity. That is about 45 seconds to initially generate the list of batch files and then about 8 minutes of file transfer time. It looks like you had about 8.5 minutes of calling update processing (those typically run in about a second). The rest of the time would have been membership and finance updates.

Looking at what I see in your logs vs. my ward's non-SSD installation, it looks like the SSD is buying a little bit of performance, but not much. The file transfer time is dominated by the network round-trip time. The update times are a little faster with the SSD, but the jump isn't as big as I would have guessed.

The biggest driving factor in your case is the amount of activity that is building up during the six weeks between transmissions. Transmitting more frequently will reduce the size of the batches. That essentially spreads the process out into a bunch of smaller batches.

davesudweeks · #10

I found the above explanation very interesting. When I was first trained as a clerk on MLS, the instruction was "first thing you do every time you start MLS is transmit, and the last thing you do every time you shut down MLS is transmit, then backup." I would guess that if every ward and every stake would do this every time a clerk was in the clerk's office, it might help...

Tech Forum

MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity

Re: MLS Send/Receive Changes (Sync/Synchronize) very slow due to inefficient disk activity