Summarized using AI

A Decade of Rails Bug Fixes

Jean Boussier • September 11, 2024 • Sarajevo, Bosnia and Herzegovina • Talk

In the talk titled "A Decade of Rails Bug Fixes" presented by Jean Boussier at EuRuKo 2024, the speaker reflects on two significant bug fixes he has undertaken during his career—one from ten years ago and another from the past year. The presentation serves as an insightful exploration of debugging techniques and problem-solving approaches in the context of Ruby on Rails development.

Key Points Discussed:
- Introduction to the Speaker: Jean introduced himself, mentioning his contributions to Ruby and Rails and his experience at Shopify.

  • First Bug Fix (10 Years Ago):

    • The bug involved slow count queries in a Rails application, which stemmed from improper use of Active Record counter caches.
    • Initially perceived as an easy fix, the investigation revealed complex interactions with existing code and long-standing issues that had persisted due to previous oversight.
    • Boussier emphasized the importance of understanding the context and history of code changes, using tools like Git blame to trace back decisions.
    • He identified a race condition in the counter caches that led to discrepancies in product count during concurrent operations.
    • The solution involved refactoring how counter caches functioned within Active Record, allowing accurate tracking of affected rows during deletions.
  • Second Bug Fix (Recent Years):

    • Jean described tackling a bug related to Active Record attribute loading issues in the context of the Rails 7.0 release.
    • This bug emerged during upgrades from an older version, impacting record initialization particularly with the built-in Marshalling method that serializes and deserializes objects.
    • He walked through various debugging steps, illustrating the challenges of reproducing issues that did not occur in his environment, emphasizing the importance of experiencing bugs first-hand.
    • Boussier developed a workaround using a different serialization approach that improved performance and addressed the original issue effectively.

Conclusions and Takeaways:

- Debugging is fundamentally a scientific method involving hypothesis formation, experimentation, and validation of results.
- Clear reproduction of bugs allows for more effective solutions and is often critical in an open-source environment where collaboration is essential.
- The significance of strong communication and documentation in discussions with maintainers was underscored, highlighting that clarity can greatly facilitate the debugging process.
- Active engagement and a sense of ownership over codebases are necessary qualities for developers to foster effective solutions in collaborative settings.

In conclusion, Boussier emphasizes the need for continuous learning and adaptability within software development, especially when navigating complex systems like Rails. His experiences highlight how understanding core principles and maintaining open communication can lead to successful debugging and enhancement of code quality.

A Decade of Rails Bug Fixes
Jean Boussier • Sarajevo, Bosnia and Herzegovina • Talk

Date: September 11, 2024
Published: January 13, 2025
Announced: unknown

EuRuKo 2024

00:00:09.120 um hello everyone uh so yeah that's me that was name dropped in the previous talk a couple
00:00:15.039 times um so in this talk I'm going to present two bugs um one what was about
00:00:22.920 10 years ago which was like my my first real ra contribution and another one
00:00:28.679 that was just like a bit more than a year ago that was probably one of the if not the other the most intricate bug I
00:00:35.559 ever had to fix but first let me introduce myself uh so jier probably
00:00:41.360 better known as byot uh if you follow Ruby or rails development you might see me as my professional altero which is a
00:00:48.840 Caspar is fine and in addition to being a part of RA score and Rubik also maintain a
00:00:56.039 number of popular gems such as like a boots snap r r client message pack few
00:01:01.079 things like that uh I probably don't need to present my employer I'm sure you're all
00:01:06.240 streaming music on your phone um so yeah as I said the the first
00:01:13.920 bug um was from 10 years ago so like every bug every contribution
00:01:21.600 there's always a context of an opportunity um back in the day I joined
00:01:26.640 Shopify in October 2013 and it was their newly created office in
00:01:32.640 another town so we were growing very very fast we were assigned projects but there was always more people than
00:01:39.000 project like than work to do so that's why I just try to keep myself busy
00:01:45.040 looking at all the issues on the bug tracker and starting to fix weird stuff um so I made a bit of a reputation for
00:01:51.520 myself um until one day my my manager just tell me hey Johan I have something
00:01:56.960 for you um and so as you can see here it's like a my squ slow query report and we
00:02:04.640 had a account query doing taking 100 20 milliseconds and you might think like
00:02:11.000 okay it's a read query so it's fine it's from a background job so maybe it's fine but actually it takes a lot of um
00:02:16.879 database capacity while your database do that that's one thread used in my SQL but cannot do anything else so it's not
00:02:23.400 something you want to keep um so being a rail developer is the
00:02:29.280 first thing I was like oh that's easy like if you have slow count queries you use active record counter caches it's
00:02:34.959 just done for that so what are active record counter caches you might ask it's
00:02:40.640 a very simple denormalization techniques whenever you inser a record you increment the counter whenever you
00:02:46.840 delete a record you decrement the counter and so that gives you on the parent model that gives you like a an
00:02:52.360 integer field that you can just read instead of doing an extra query um it's mostly beneficial if you
00:03:00.959 have like a redv use case if you're inserting way more than your counting don't do it
00:03:07.840 but um yeah and so under the wood um active
00:03:13.560 record does like um Atomic increments and decrements as part of the transaction that inser to delet the
00:03:20.040 record so it's like all Atomic it's just all it's all keep in synct um not for
00:03:26.200 free but for very cheap um however when I went to track
00:03:31.720 down where the code was coming from like the query was coming from um I found that the model actually already had a
00:03:38.400 cter cache but for some reason the code wasn't using it and using the good old
00:03:44.400 git blame I saw that just a few months prior someone like purposely stopped
00:03:50.000 using the counter cache so what seemed like a very simple fix started to be
00:03:55.680 more complicated so tracking da by commit and like the discussion prior to it um there
00:04:03.200 was a bug report of some of collection so it's like Shopify so you have like products they are part of Collections
00:04:09.200 and you might not want to know how many products there is in the collection and some people in there I mean would see like minus 500 products in the
00:04:15.959 collection which obviously sounds a bit wrong U there was also several link
00:04:23.160 peers of atoms to move counter caches in radius to try to because radius is atomic so but nothing worked so far and
00:04:32.440 that that been a long-standing bug for about a year I think something like
00:04:38.240 that um and so I I went to tackle the issue and like all the previous items
00:04:44.759 were like okay this must be a race condition somewhere in rails um and they they went nowhere and
00:04:52.560 for me like Contra cashes have been in rights like as far as I can remember so people were using this for 10 years it
00:04:59.199 couldn't possibly be a bug in R right this was Battle tested um so I went instead
00:05:06.840 um I went instead with the idea that we probably were doing something wrong so I went to see about the implementation of
00:05:13.639 cont caches which don't get scared I'm goingon to go over it uh
00:05:21.000 but uh this is just meta programming to Define model callbacks if you remove the
00:05:27.080 meta programming layer you get this which is much simpler to and the sun like all of you should be able to uh
00:05:33.520 basically um on after create you increment on before destroy you decrement that's it as simple as
00:05:41.360 that um so I went over like I basically audited Shopify codebase went to look
00:05:47.840 for since those are callbacks you could use you know like update colon or like some lower level apis that bypass
00:05:55.479 callbacks so that was my theory but I just couldn't prove it I spent like a couple days looking everywhere to for
00:06:03.039 proof of that but just couldn't find it um so back to square
00:06:09.479 one but then that's when it hit me that looking at the at the production data
00:06:14.759 like all the desynchronized counters uh they were all negative it was always
00:06:20.400 they were always displaying less than the actual count uh and as you could so
00:06:26.000 some many were actually negative absolutely negative so this already allow to narrow down the
00:06:33.319 problem like if it's if we decrement too much it means the issue is on the destroy path not on the
00:06:40.199 create oh actually it could be that we don't increment when we create
00:06:45.360 but um so perhaps it was indeed a um a concurrency
00:06:51.960 issue so when you try to reproduce a bug on a u j app it can be complicated because
00:06:58.919 you have lots of overhead and like recreating the scenario with all the validation can be complicated so I just
00:07:05.199 want to create like a small application like your B basic like blog post with
00:07:10.599 comments and and whatever um and I run it with a unicorn to have like actual concurrency because that's also a
00:07:17.000 problem when you try to reproduce in development usually you have like many things that are different than
00:07:23.599 production and that's where I managed to actually reproduce the problem as you can see
00:07:29.160 like the counter cache that's displayed on top says there's one product but actually you can see two under it and
00:07:34.759 all I did this is was like mashing uh the delete button in the
00:07:40.520 browser for the anecdote I tried to reproduce this 10 years later for this for this talk and I just couldn't
00:07:46.039 because like Chrome and Firefox know like prevent double clicking so I had to simulate it a
00:07:53.879 bit um and so now that I had a reproduction I could really look in details on what happened Reser and
00:07:59.720 trying to get through like gigabytes and gigabytes of production log I had just like a few lines and I colored the I
00:08:08.520 colored the the relevant line and you can see like the the first request which is in red um first load the product oh
00:08:17.039 sorry I don't even remember but like load the parent record like the record and
00:08:22.720 then just after that the second request loads the same record as well in memory
00:08:28.360 and then they both deleted so uh that's where the the proing
00:08:35.800 problem from and as with most concurrency issues because the job of our database um is to of like
00:08:43.039 transaction and everything is to make it appear everything like if it was a Serial series of events there's no
00:08:49.120 actual like there's concurrency of course but like at the end of the day you can create a linear series of events
00:08:54.920 you don't need threads or anything to to reproduce and so here B log translated
00:09:00.440 in in a simple um Ruby snipet is just that in the same process M thre you load
00:09:05.720 both like you load the same product twice and then you destroy
00:09:10.880 twice and so I wasn't so great to squl at the time
00:09:16.920 I just had went to school and they told me like oh transactions are Atomic and so yeah that's
00:09:23.160 it but turns out it's a bit more complicated than that um in SQL delete
00:09:28.640 is an ed important operation so you can repeat it and it's just going to always succeed it might delete something or it
00:09:34.399 might delete nothing but in both case it's considered a success so if we always decrement after
00:09:43.040 uh Delete assuming we actually deleted something then we we decrement too
00:09:50.720 much um so basically same thing it was very easy to create a test case to
00:09:56.720 reproduce this because when you're trying to debug something having a good feedback loop like a very
00:10:01.880 fast feedback loop is is Major productivity gain and wor case with his
00:10:08.160 with his test case if I couldn't fix the bug myself I could just like open an issue on Rails and say like hey here's a
00:10:14.040 test case you can experiment for yourself so like getting a Repro like a
00:10:20.079 programmatic Repro is really like the most important things you should drive for when you're looking for a
00:10:26.200 bug um so now that I knew what the bug bug was I was as I said like this has
00:10:31.880 been in race for 10 years it's impossible we were the first one to eat it so I went to search for no I had a
00:10:37.720 keyword I could search for proper bug reports and um there was indeed some reports
00:10:45.959 there was even some um propos fix for it like some PO requests and one of them
00:10:51.760 was to put a pessimistic lock accuse a databas locking to prevent this uh race
00:10:57.800 condition but it was was rejected or or at least not considered because the
00:11:04.040 performance impact of loocking this explicitly during the transaction is really really twint so I knew that I had
00:11:10.639 to find another solution than that but reflecting back if you look at
00:11:17.480 those at this like MySQL debug like not debug but like console station like the
00:11:22.639 answer is right before our eyes right it tells us that the first delete actually deleted the record and but the second
00:11:28.200 one didn't so if we could um if we could get this information from the Ruby side we could
00:11:35.519 just say like hey if you deleted the record you decrement if you didn't you do
00:11:41.560 nothing and conveniently we actually advin information um maybe you have um
00:11:48.880 sometimes used like um you know the lower level ra CPI like a update all or delete all and things like that they
00:11:55.480 actually return you that number the number of affected rows um so the data was here we just couldn't
00:12:01.800 choose it because since cont caches are callbacks they're not they don't get
00:12:08.399 this information right there in a different context so it was we had everything it
00:12:14.440 was just a matter of raring like putting all the pieces
00:12:20.600 together um so I went to refactor um counter caches so that
00:12:27.760 they're no longer callbacks they actually core to active record and this way um they could just call like the
00:12:35.279 actual uh Delete statement and get the value back and check if the effect draw
00:12:40.440 is indeed um Superior to zero but before submitting this patch to
00:12:46.560 rails um I wanted to test it in production because um maybe my fix wasn't sound maybe it that also had
00:12:52.399 performance issues maybe there was other bugs like as we say you know a train can ID a
00:12:59.560 b canos as well um so I made a monkey patch vation of that fix and I shipped
00:13:05.480 it to in production and um I did like a migration data migration to reset all the counters and
00:13:13.240 everything um and after after several weeks I don't remember quite how much uh
00:13:19.000 we couldn't notice any desynchronization whatsoever so I was able to R it um the
00:13:24.040 initial commit that stopped using the counter cache solving the slow count in the M time
00:13:30.839 no that I had my uh fix um I had to send it up stream right because you don't
00:13:36.440 like other people can benefit from it and you don't want to run a monkey patch forever um and I just did like a clean
00:13:44.279 pier and it fit really really good um
00:13:50.040 but Aron P end up reviewing my Pro request um but it didn't exactly go as I
00:13:55.839 imagine um based on his Rons responses he didn't quite understand what the bug
00:14:01.600 was which was probably my fault for not explaining well uh I was a bit
00:14:06.759 destabilized U so I answered very stupidly told him like his suggestion
00:14:11.880 was a terrible idea with all due respect so uh my my English like just my proper
00:14:19.839 English level and it's just my communication skill in in general where VI great at the time know that I
00:14:25.560 maintain many project I totally understand how that works right because I had been working on this for weeks I
00:14:30.920 had like all the context in my head and I didn't do a good job of like really dropping that context in the in the
00:14:38.120 issue um and uh yeah sorry um so yeah it's
00:14:46.199 really it's really about like people cannot read your mind it's really important to track back everything all
00:14:52.120 the important information and put in a good way but lucky for me Aon is really
00:14:57.880 nice so he just just mocked me a bit but it's fine but like many maintainers would have just like ignored me or
00:15:03.759 something even I I'm pretty sure today like myself today I would have ignored me back in the day or something uh but
00:15:11.000 so the pr was ultimately managed after the consequen so reflecting back on all this
00:15:18.839 pH like a few takeaways like I took for myself or whatever is just like debugging is just a scientific method
00:15:25.320 you you just theorize you do an experiment you confirm on in film and then you just CLE back until the problem
00:15:31.839 is now done enough that you know where the problem is um when you get more
00:15:37.199 experience you you really know the application or whatever you sometime might do like better theories and so you
00:15:43.399 might go faster but like the the process is still the same um and yeah I touch this a bit
00:15:51.000 sooner but like the most challenging part of the beginning is to get like a a r production once you get a I know very
00:15:57.639 very little bug that have a r production and haven't been fixed however I know a million bugs where the reproduction
00:16:04.560 steps are unclear or non-existent and they're just rooting there right um as a coral it means the best
00:16:12.199 person to investigate a bug is the person who experience it like if you just go on a bug tracker and say like
00:16:19.000 hey I have a bug and it's un clear how to reproduce it the Mainer 99% of the
00:16:24.440 time just cannot do anything for you um and yeah if you go look at Trails
00:16:29.560 like you're going to find like five six years old issues where it's just a vague bug report with no reproduction nobody
00:16:35.959 can fix that for you um another thing is to always
00:16:41.560 question your assumption uh I made I I lost a couple days thinking like oh the right the bug cannot be that because and
00:16:49.120 just be careful about that another another thing is that your
00:16:54.800 depes are your code too like this is open source like else it just it's the
00:17:00.240 same thing than I don't know a controller that I've been written by a colleague of yours and maybe it's in vacation or or something or they in
00:17:07.240 vacation uh you're not going to wait two weeks or whatever it's just it's your code too right you need to be able to
00:17:14.319 you don't have to know it by by before to use it but if it doesn't work as you expect you need to be able to figure it
00:17:22.559 out um and also like hopefully this is never the case but like you shouldn't
00:17:28.319 need to ask manager like can I do a PR Upstream just it's your application
00:17:34.760 really and finally um communication is important we all know the people who
00:17:40.360 think they're like l stal or whatever and they just can be um as a
00:17:45.760 maintainer I can work on your issue on my free time or I can do a million other
00:17:50.799 things um it's just I'm not saying like oh you should be polite because you can be extremely insulting without swearing
00:17:58.679 but or you can be very nice and swearing a lot it's really about just like trying to cooperate and help and not be judgy
00:18:05.640 and and like keep your grumblings for yourself kind of thing make me want to
00:18:11.080 help you or make the Mainer want to help you uh I just need to te
00:18:24.960 it so second bag 10 years
00:18:30.080 later um
00:18:36.760 so yeah fast forward 10 years I became like a ruby committer um I became a
00:18:41.840 racec member um still working at chifi but in a team dedicated to work on Ruby
00:18:47.840 and race so I was like you know nothing no no Ruby bug or no R bug
00:18:53.799 can scare me anymore oh so I thought so
00:18:59.440 March 2023 I just saw um a toot on masteron
00:19:04.720 and someone said like hey there's a bug in masteron because myON is written in rights and I had a bit of time to K and
00:19:11.360 I was like oh this see this person is asking nicely so I'm going to go
00:19:17.679 look and so classic um not super clear bug report but like recently um Aon
00:19:26.880 Patterson had been um uh making change to masteron to allow
00:19:32.200 apps to upgrade like masteron was stuck to Ruby 31 and Ruby 32 was new and he
00:19:37.600 wanted to see the performance difference so he unblocked them to upgrade and some uh users started to upgrade and starting
00:19:44.000 thing with um arrows like actual
00:19:49.760 exceptions and this is a the back trace and like I was just reading it and it's
00:19:56.919 trying to like it's failing like without looking at the codee it's failing trying to read the ID attribute
00:20:03.280 of an active record instance so to me you know like I just like oh this is simple um I was very
00:20:10.240 overconfident I'm like Oh my son developer simply run into small limitation like of course um I said like
00:20:17.200 oh this probably one of your model that doesn't have an actual ID um ID attribute maybe it's just like
00:20:24.000 a joint table um I'm just going to fix that in ril it's no problem you can just
00:20:29.480 rock around it that way uh and of course like 5 minutes later while trying to do
00:20:35.240 the change in R I was like wait a minute but doesn't add
00:20:40.679 up I ignored tons of clues in the thread and I just jumped to conclusion way too
00:20:46.760 click way too quick because I was clearly overconfident um when I went to look at
00:20:51.960 the source code um the problem is not the model doesn't
00:20:57.200 have an id0 colon but that we're trying to access the ID of an of an instance
00:21:03.320 that didn't have its attributes values loaded yet so it was like a partially
00:21:08.640 initialized record much weer than just that another clue that I have been
00:21:15.880 totally ignoring is that it was only happening to people upgrading to 3.2 so like the rail limitation was totally off
00:21:23.559 track um so I I had to come up with a theory that would include
00:21:29.440 the upgrade like the change between 31 and 32 as a
00:21:34.600 bug um since the bug was triggered by Marshall trying to load the the cach I
00:21:39.799 went to CH to look for the changes in in Marshall um so a good old git log thing
00:21:47.080 and looking at the comit there was basically nothing and honestly from experience when you're starting to use
00:21:52.320 git log to try to find a bug it's just like you're desparate you have no idea where you're looking at this is never or
00:21:57.640 you find bugs unless you know it's really between like five comets but one year worth of change it's like forget
00:22:04.799 it so at this point I I just didn't think I could go any fter with a TR
00:22:09.919 producing because here I was trying to debug something that wasn't happening to me so it's just like trying to help your
00:22:16.440 your parents on the phone to debug their internet box or something it's just absolutely terrible so I needed to if I
00:22:23.840 cannot reproduce myself I cannot go any further um
00:22:29.159 the next day another us user posted like an interesting observation they tried to load the broken Marshall pad from from
00:22:35.600 rist 3 one so like they took a a cash pad from production and tried to load it in the old Ruby and it was fing failing
00:22:41.799 the same way so this was one more clue right um we we knew I mean I was
00:22:48.559 convinced the bug was in Marshall but I didn't know if it was um a bug in loading the Marshall or in seizing it
00:22:55.080 and so if a payload generated by 32 is still
00:23:00.360 broken in 31 then I know it's in the generation so we we we Al the hypothesis
00:23:05.679 yet again but I really needed my hands on like a corrupted payload because I I
00:23:12.960 would needed to inspect it the problem is Master on Cache contains lots of like personal um identifier data and stuff
00:23:21.039 like that so you cannot just like hey take this because that maybe private messages that may be
00:23:27.080 like um so the the next day someone shared Like a decompiled Pill like it's basically
00:23:33.039 like a a of the the pill so I couldn't execute it but at least I could see like
00:23:38.400 they did this so that they could strip the personal information um so at least I could have like a various clues about
00:23:45.200 what it looked like um and that led me to a new theory
00:23:51.799 is that in the P I could see there was many circular references I'm going to tou a bit more on that but basically
00:23:57.080 like object reference like a referencing B and B referencing a marshall is able to un
00:24:03.880 that um and so my theory was that an active record instance was being used as
00:24:10.559 a ash key before like in a circular way before it was fully instantiated so
00:24:16.080 basically that would look like this pattern um here if you're not familiar allocate
00:24:22.679 like when you do um some class. new in Ruby basically it calls allocate and then it call initialize on the return of
00:24:28.440 allocate so here I'm I'm allocating an object without initializing it without calling the
00:24:34.080 initialize and if I use that object as a ash key I had exactly the same error I
00:24:40.000 had like users had in production so that that was a
00:24:46.200 confirmation so you need to understand a bit how martial works like it's not a magical thing of where you coll Lo and
00:24:52.600 then instantly all your object graph has been recreated by by Marshall it just create it reads the it's a stream of
00:24:59.279 instruction on how to recreate the object graph so it creates the objects one by one and then just rebuild the the
00:25:05.520 graph which mean that like many points in times you have many partially initialized
00:25:13.919 objects um and because of this in case of secular references uh it may have
00:25:20.080 like Marshall may have to inser a partial initialized instance as a ash key or something so that was purely a
00:25:27.760 theory ADV that point but something like that could reproduce so I had a pattern that could explain it so I need to look
00:25:34.159 for it um since I know active record relatively well I I had a few ideas of
00:25:41.440 where the thing could be and so I started gripping and things like that
00:25:46.600 and um one place where we use active record instances as ashys is the
00:25:51.880 association cache when you load like prod products. comments or something vasor Ash with um instances as
00:26:01.039 Keys um but there was a missing piece right
00:26:06.520 is because for like okay the record was
00:26:11.640 partially instantiated but it meant that we would load an association cache
00:26:17.279 before we load the attributes which made no sense because if you were to create a new record and look at the order of the
00:26:25.120 attributes uh of well of the instense variables attri are before um so we will need also to find a
00:26:33.520 place where uh the ordering but like let me show you all ordering you probably
00:26:39.799 never ask yourself this question but like all instance variables are ordered in
00:26:45.880 Ruby so at the surface is quite simple um Ruby just preserves the under of
00:26:52.440 assciation so here we have like two classes U A and B one Define um this
00:26:58.399 they both defined in reverse order and when you ask for instance variable you get the order you set the
00:27:03.919 variables but what if the same class can sometimes have different ination
00:27:10.080 orders like in this case it's the same class and and um in one case we Define a
00:27:15.600 and then B and in the other case we Define B and then a and you can see that
00:27:21.960 like the first case it's what we expect but that in the in the second case it's less clear
00:27:28.000 right like should you reuse the same order than from the first instan station or do you have your own order because
00:27:33.799 you're a different instance on the on the Ruby commer slack
00:27:40.360 we have a a very um useful Bots where you can give it a snipet of Ruby and it's just going to run it over all the
00:27:46.480 versions of Ruby and show you the differences um so if you ignore the 1.8
00:27:52.039 where this was returning strings you can see that in 31 up to 31 the very first
00:27:59.519 inertion order seen by the class was preserved and the same for every single other instance of the class but in
00:28:06.000 street2 that changed in 32 every single instance have um its own
00:28:13.960 order so this was a Smoking Gun because we have a difference between 31 and 32
00:28:19.760 and the reason for that is object shapes uh I wor closely with uh two colleagues
00:28:24.960 of mine were working on that this is like a optimization so now we needed to find a pattern in
00:28:31.240 active record where we could cause this and um I grab for it and I thought
00:28:39.320 I saw saw it because I saw like oh this this initialized up when you do
00:28:45.720 a record it resets the association cache and then call Super so all the
00:28:50.760 attributes are set later so I was like really really excited like I found the bug I had like so many exclamation
00:28:57.159 points uh but actually I was
00:29:03.679 wrong when you dup actually like the attributes are copied directly and when the Callback is called so it just didn't
00:29:11.240 work uh but oops sorry but then thinking about it more um
00:29:20.519 what happens is that I had been missing in my console for a very long time and I created a state that was reproducing but
00:29:26.039 just that wasn't it and the MTH is that it's a bit complicated but like
00:29:32.000 shapes are an optimization unless you define your attributes in such a different order every time that it
00:29:37.399 creates so many possibilities and when that happen Ruby de optimize and say like okay I'm just going to use a ash
00:29:43.480 you know for that because it's it's too complicated otherwise makes them a gigantic tree and so to reproduce this you had to
00:29:51.159 create so many different insertion order on that particular class so that it will de optimize the
00:29:56.240 class um so I had finally after four five days I
00:30:02.360 actually had my Repro so agab complex object being Ed it's just a
00:30:10.840 bug in Ruby right it's just it's a bug in Ruby not in RS we should fix it in Ruby it has been working since then but
00:30:17.360 I like ra 7even was just new I couldn't say like oh don't upgrade to ra to Ruby
00:30:22.840 if you're using right 7even it just doesn't work so we need to fix it in rights as well um
00:30:29.000 and I saw like this is like sorry um in Insight like the the actual fix we did in Ruby
00:30:35.679 just shipped like N9 months later or something so that would have been terrible not to to fix so I needed a quick work around for
00:30:43.120 masteron users I needed a fix in Ruby and I needed a fix in rise at that point I was curious because
00:30:50.440 at ch5 we had upgraded to 3.2 for like three four months and we never ate that bug um but it's because we don't use
00:30:58.159 Marshall we basically baned Marshall for this sort of reasons we use a library called pakito
00:31:06.919 and so I thought okay I'm going to backport something from pakito into Mastodon um so basically I just replace
00:31:16.000 instead of just doing Marshall dump I replace with a a custom serializer I'm just going to show you very quickly it
00:31:22.240 works it's just like every record is basically an array with like the class name the list of attrib Utes and your
00:31:28.480 record and then we rebuild it cleanly from that we don't rely on Marshall resetting with the order and
00:31:34.360 everything um as an added benefit which was like way faster it's like the
00:31:39.799 pillows were twice smaller and the serialization time was also like cut in half more or less so I basically then
00:31:47.320 did pretty much the same thing but for rails itself of course the pill is different so there's like a migration
00:31:53.000 step you need to to activate the new version and everything um but now it's
00:31:58.360 fixed for everyone in R um and I I even did um kind of a bet verion because you can
00:32:05.200 control all like object can declare other seriz by Marshall so it's still using Marshall but it's not using the
00:32:11.279 automatic kind of thing it just say like oh if you want to convert me to Marshall you just convert it like this array with
00:32:16.559 three elements um and similarly in um in rails
00:32:22.679 that cut down the payload and the diation time quite a bit so now needed
00:32:28.200 to fix it uh in Ruby um so I mean $10
00:32:33.240 broke it so I told him to go fix it uh and um the issue is that when Ruby de
00:32:40.200 optimize and say like oh stop using shape I'm going to use a ash it was using an unordered Ash like technically
00:32:46.600 it's an order Ash but not an order we care about it's the ination order um the the patch is much more
00:32:53.039 complicated than that because it's C so you need to fix things everywhere but basically um each change the type right
00:32:58.720 in instead of using ID table it's using St table it's exactly the same Ash you're using when you're using ash. new
00:33:05.200 in Ruby it's really the same thing um and the fix was then back
00:33:10.480 ported in three do 3.2.3 um so yeah just
00:33:17.880 to finish on on conclude on this uh some more takeaways um I really need to
00:33:24.880 double check before posting because like this R if you go look it up it's like all of me saying like oh I found it and
00:33:31.000 two minutes later like oh no I was wrong just made a bit of a full of
00:33:36.320 myself um and I before I said like oh when you get more experience you know
00:33:41.360 the system well you might be able to take shortcuts uh be beware because you might actually discard some information
00:33:48.840 because of that also as I said before it's way harder to debug something you're not
00:33:54.760 experiencing yourself like it's really really hopeful if from day one I had a
00:33:59.880 corrupted pillow that would sa like three four days of debugging um and to me it's like the big
00:34:05.000 advantage of Open Source it's just like you can directly debug yourself you're not just trying to do back and forth
00:34:10.520 slow back and forth with other people um be very careful with Marshall
00:34:17.000 uh it can be convenient but like it's really really dangerous um honestly my
00:34:23.119 advice would be like if the thing that's going to deize what you serialized is the same program with the same version
00:34:29.599 it's fine but even across the across two version of the same program it start to be very very dangerous because maybe
00:34:35.240 it's referencing a constant you removed things like that so try to avoid it like the
00:34:41.440 plague um I saw like it's small for Ruby but like semi deterministic behavior is
00:34:49.320 really the worst because like this is what made it so hard um to reproduce and
00:34:56.280 also just because you're trying something RB and it works and maybe it has worked
00:35:01.680 like that for five a doesn't mean it's continue like that like if the documentation doesn't tell you that the
00:35:07.359 insertion order can be trusted you shouldn't trust it trust it there's a something called irum law that say like
00:35:13.200 every single observable behavior of your program would end up being relied upon by one of the users but exactly what was
00:35:21.240 happening and yeah it's never a rby bug like uh don't go out of this talk thinking like oh I know the weird thing
00:35:28.119 I'm seeing is must be a ruby bug it's never until it is um and yeah z r VM is
00:35:34.920 your code too because your car's running on it if it's buggy your your program is buggy so just be curious just go dig in
00:35:42.520 offing walks and and just yeah search search into it that's all I had thank you very
Explore all talks recorded at EuRuKo 2024
+39