00:00:05.200
cool all right so hi hi
00:00:12.880
everyone yeah first I just wanted to thank Matt and Andy and patlock for for
00:00:18.520
hosting uh I actually used to come to Ruby M RB all the time back in 2007 2008
00:00:25.439
and it's my first time coming back tells a bit about my my age too so I'm very
00:00:30.800
glad to see that Ruby is still alive in in Montreal and when Mark was the th
00:00:36.320
server yeah exactly yeah those years um cool so that's me so I'm the
00:00:44.120
co-founder and cudio of circle medical been doing Ruby for long time 2005 I'm
00:00:50.520
from Montreal uh but I've been an expert in San Francisco since 2011 uh but my most of my my family are
00:00:59.039
are a so I come back regularly um so what circal G does is like like
00:01:08.400
pretty important context for what I'm going to talk about um tonight which is an AI medical scribe so I'm going to
00:01:15.400
start by giving a little bit of context and the problems we've been facing and why we we went on that that that
00:01:22.320
Journey um so cirle we cir we're building the technology to make delightful quality Primary Care
00:01:29.000
accessible to everyone um the thesis of circle mle starts from the Iron Triangle of heare
00:01:37.240
which in traditional lare says that quality cost and access are an
00:01:42.320
indivisible trade-off so if you want better quality you need to increase cost or lower access if you want a lower cost
00:01:49.079
you need lower access and quality and and so on and we fundamentally believe
00:01:54.640
that technology can break that higher triangle of of healthare
00:02:01.280
um so we've been working on this for eight years so it's been quite a journey
00:02:06.320
we started uh at y I just thought it was very interesting that the triangle is
00:02:11.760
very similar to the triangle of software engineering with cost scope and time
00:02:17.680
yeah yeah yeah similar um so so so yeah we we SP
00:02:23.879
started with that that that belief that we could break that that triangle using technology um and we started Circle
00:02:30.599
medical um in San Francisco back in 2015 we went through a y combinator and
00:02:37.360
during the pandemic a big shift happened where a wave of Teel came and Teel
00:02:44.720
became accepted with doctors as well as uh payers so insurance companies and
00:02:50.720
we've been really writing that wave uh since then um today we're seeing more
00:02:55.840
than 50,000 patients per month uh 97 7% of them are virtual we do have 23
00:03:04.159
physical clinics um across the the us where we see patients in person but it's
00:03:10.120
a very small percentage um and then we have quet quarters in San Francisco and and in
00:03:16.720
Montreal um our team is about 25 engineers and we have like a pretty bland Ruby on rail stack with with with
00:03:24.040
Rea nextjs and uh we're building native app for for for the patients as
00:03:31.680
well
00:03:36.920
um so why did we went on the journey to start building an AI uh
00:03:42.840
scribe um so first there's two big problem uh that I've been around for a
00:03:49.239
very long time in the the medical uh field uh the first one is what is often
00:03:54.879
referred to as pajama time so it's the time that doctors spend documenting
00:04:00.799
their visits sending orders at the end of the day after they're done seeing their their patients um and there's some
00:04:07.439
statistics that in the US doctors spend on average two to three hours at night
00:04:12.799
finishing their their notes um and that is in part responsible for some
00:04:20.280
of the the burnout like in 2022 63% of phys Physicians reported that they were
00:04:25.840
experiencing some kind of of of symptoms of of burnout uh so we put a lot of
00:04:31.160
administrative work on physician when they should spend their time taking care
00:04:36.759
o of their patients um and then the second problem
00:04:41.840
is a problem we started experiencing as we've been growing so the documentation
00:04:47.120
is very important for many reasons first for patient education so what goes in
00:04:52.639
the note part of it is the plan for the patient that goes to the patient uh the the clinical node exists for legal and
00:04:59.000
compliance like doctors to cover their butt essentially um it's important for
00:05:04.880
continuity of care so if I see a patient and then I refer you to somewhere else the other doctors needs to have the
00:05:10.520
story of the patients uh it's also important for billing especially uh in the US everything all the Belling is
00:05:18.240
done around what's have been documented in the um clinical node and it really
00:05:23.520
impacts how much you're reimbursed um so as we've been growing the number of providers today we're
00:05:30.080
350 doctors it's been really hard to keep the quality for all those things
00:05:36.080
consistent across uh all all providers um so possible solution a
00:05:43.960
scribe that listens to the patient provider conversation and writes the documentation for you it's actually not
00:05:49.720
a new idea there are Services where you can hire a human that sits next to the doctor and then will type and and
00:05:56.720
document it's just very expensive when you need to have a a human it's also a
00:06:02.199
little odd to have like a third person when you're talking with your your doctor uh and it's also very like
00:06:12.120
imperfect so when llms came around it seemed really like a a game changer to
00:06:18.440
be able to solve that problem with with technology um at least three reasons first um llms
00:06:28.319
are really good at some NLP task like summarization and text generation so taking a conversation summarizing it and
00:06:35.759
then generating a clinical note in a certain format the GPT has actually been trained
00:06:41.919
on on clinical nodes you can tell by prompting I don't know whose medical
00:06:46.960
records but it's definitely been trained on medical records um it also has a lot
00:06:53.120
of medical knowledge which help actually being like accurate and and good in what
00:06:59.199
it it suggest um gp4 just without any specialization at all um has exceeded
00:07:07.360
the passing score for the US medical licensing exam by by 20 points so it it
00:07:13.479
would pass the medical exam with pretty good
00:07:18.800
grades um and then why Ruby uh or stack a ruby we we love Ruby I love Ruby um
00:07:26.879
and then uh I believe that Ai and
00:07:31.919
machine learning is now actually accessible to every developers it should be in all developers tool box it's not
00:07:38.440
just for data scientist anymore like those things are are easy to use so
00:07:44.639
you'll see later tonight um then the Tooling in Ruby is is is pretty good
00:07:51.479
um thanks to some like great dedicated contributor uh like Andrew Kane uh but
00:07:58.039
like most of the tools are or are existing already out there or have binding to to to machine learning tools
00:08:04.360
that are are used in the industry um then why did we decide to build this
00:08:12.639
in the house instead of of of buying it because there's like tons of startup doing I cribes today uh it turns out
00:08:20.199
that D llm is what really does like most of the heavy LIF
00:08:25.400
lifting um building it like really provide a superior experience to the
00:08:31.879
doctor by being able to integrate it into or or or EMR uh and then we also have the data
00:08:39.360
and we have the feedback loop with doctor to get feedback so uh that's actually where it's it really helps to
00:08:45.839
build a better um better product uh and then we also wanted to develop some of the expertise in house because we think
00:08:52.480
LMS can be apply to a lot of other problems uh in Primary
00:08:58.200
Care so with the Contex here's our uh road
00:09:03.920
map for tonight so I'm going to take you a little bit on the journey we we've
00:09:09.399
been um and go through some of the challenges we've encountered as as as
00:09:14.519
we've uh done this and hopefully share some insights that can be useful and in
00:09:20.000
and inspire you with some of your other projects that that would use llms uh so first uh the first step is
00:09:28.360
capturing uh transcript so you have a patient doctor conversation so in your case and
00:09:36.279
mo most of the time 97% of the time it happens in a video call uh between the
00:09:42.120
provider and the patient and the starting point is we we need a a
00:09:47.360
transcript uh from that that that conversation
00:09:52.440
um so there's a lot of ways you can you can accomplish that uh with different
00:09:58.240
pros and cons cons so uh I'm going to just quickly go through each of of of
00:10:03.360
the options with with what we found with some of the pros and and cons so the first option is just to integrate with
00:10:09.839
your video call Vendor API so tons of vendor already have
00:10:15.680
transcript buildin like Zoom Google meet we almost started earlier daily Co is a
00:10:21.760
big one as well um and that's like very easy to get a transcript from there it's
00:10:27.560
just typically you change it setting and you use their API to access so it's getting it in our case from Zoom is is
00:10:34.959
is four lines of of code essentially to get a transcript back at the end of of
00:10:40.320
the visit so it's really easy to integrate it's the experience is seamless for everyone because it can
00:10:46.720
already capture the audio um the cons is the the quality of the transcript can can vary depending on
00:10:54.560
on the vendor you use you don't have a lot of options you just take what what they have um and then the ability to
00:11:00.800
have a real-time transcript can also vary so you might only get the
00:11:06.279
transcript at the end of of of the call uh another option is that you have
00:11:12.880
a bud that joins a video call if you go on sales calls you may have have seen that all all sales team now have some
00:11:19.120
some kind of not taking bot um so um that's nice because it works across a
00:11:26.440
lot of different vendors you typically can get a real time transcript uh if you
00:11:31.720
are in a patient provider call it's a bit odd to have like a third person
00:11:36.800
which is a script bot in this stream I it's not it's not the end of the world but it's a little odd um if you want to
00:11:43.720
build out yourself it's like pretty involved um but there are third party
00:11:48.880
that you can use like recall that AI is an example that just like have a very
00:11:53.959
very simple API to to to do this uh that's something we couldn't use an
00:11:59.639
elare because they don't have the compliance still a small startup um then another route is
00:12:07.720
building a desktop app uh which is obviously like a pretty involved U
00:12:13.959
project so it's nice because it works with all the video apps it works on the desktop client also like today by
00:12:21.120
default or doctor use a zoom client uh it's a seamless experience for the
00:12:27.160
patients uh but obviously developing uh like a and maintaining um a desktop app
00:12:34.160
is is is a lot of work especially when you need to support multiple uh
00:12:40.120
OS um and then uh finally uh web app an
00:12:45.320
option or a Chrome extension uh kind of go in the same category here uh is is
00:12:52.639
possible uh and fairly easy to build there's tons of third party vendors that
00:12:58.240
does transcription today so you have a very wide choice of options there's even
00:13:03.839
like medical specific uh ones and some for other industry as well so you you
00:13:09.639
can get really good transcription um and it's yeah fairly easy to integrate the
00:13:17.639
um it works only with when the video call is is in the browsers uh which is a
00:13:23.360
limitation but not the end of the world uh and then the ux is also a little clunky when you
00:13:30.680
want to um capture the audio in the video call and the audio of the person
00:13:36.360
talking uh so they all have uh trade-offs so on
00:13:42.279
Ouran we decided to go with getting the transcript directly from or vendor zoom
00:13:48.160
and we're probably going to go towards a web based uh solution eventually to be
00:13:53.480
able to get better transcriptions uh so what are the common issue with transcripts uh omissions so
00:14:01.279
sometime it misses things names are or can be challenging so for us the name of
00:14:07.600
medications was like if you don't have a medical base transcriber just doesn't
00:14:13.480
get them medical conditions Pharmacy names can be challenging uh date and times also U are
00:14:22.399
sometime a little tricky uh so originally we spend some time like looking at trans script and
00:14:29.800
then and then we ask ourselves like how accurate does the transcription actually needs to be and it turn out that it was
00:14:37.440
a lot less than than we thought uh like even medication name that are not picked
00:14:43.680
up that is just like WR written in the sounding word but it's not the actual correct name uh the llm would correct
00:14:51.519
them once we send them to the the llm uh so it it corrected a lot of Errors based
00:14:58.199
on on on the
00:15:03.519
context um so that was for transcription uh or second step is uh
00:15:09.800
the identification of the transcript uh so why the identify the
00:15:15.800
transcript I mean some of it is a little obvious but uh privacy and compliance
00:15:21.600
and heare um some people are a little but some reason be spooked out that we would
00:15:27.720
send some like sensitive information as their conversation between the the doctor and
00:15:33.880
and themselves somewhere else um also it's important to mention that the the
00:15:40.240
an ifying is not a replacement to have an AI stack that is that is secure and and compliant it's not the
00:15:46.600
identification is not perfect um and then if you're starting
00:15:52.120
to find tune the model and train the model then that's even more critical because some of the data you feed to the
00:15:58.199
model could be exposed or or you may become vulnerable to some some
00:16:05.160
attacks um so the identification uh solution there's like
00:16:13.000
tons of options out there like some it's built in in the video call solution like
00:16:18.560
Amazon transcribe is an example assembly AI as well where you can just ask with
00:16:25.480
an option to get a a DI identify trans scripts uh so that's that's very easy uh
00:16:31.839
there's some also the identification um dedicated Services um
00:16:38.279
from what I hear like the leader solution is private AI which just does that and a very high currency and we
00:16:46.399
were both to configure like very detailed what you want the the
00:16:51.560
identify uh and then there's obviously a bunch of like open source solution if you you prefer to do it yourself yeah
00:16:58.920
I'm curious what is the identification exactly is it simply like eliminating
00:17:04.520
words that can identify the person recorded or is something else it's yeah
00:17:10.600
I have an example here so obviously like all the demographics and and things like
00:17:16.439
that uh and then you could also uh like a we're not identifying medical
00:17:21.839
conditions but you could do that also you could just say like remove all medical conditions uh so that's an example of uh
00:17:31.080
of uh private AI which you can go online and try there their their their products
00:17:37.480
so you can actually configure what yeah what you want to to identify they identify and in nare in the US there's
00:17:44.799
like a very clear um like guideline as part of the the Privacy there's like I think it's
00:17:52.480
18 like elements pieces of information that you need to remove in order for it to can be considered the
00:18:02.720
identified so it's basically from what what we're seeing here is the name the gender and the age the location yeah you
00:18:10.679
could have phone numbers credit cards like there's a bunch of other things you this affect like the recommendations
00:18:18.400
you know like for example age and gender those could be things that
00:18:25.840
important it could yeah yeah could
00:18:35.000
um so that was the the identification so third step actually generating the
00:18:41.120
clinical note or writing the the the clinical
00:18:47.200
documentation um so I'm going to take you a little bit through like a few iterations like practice we did a lot
00:18:53.520
more iterations is a bit more involved in this but I I'll try to keep it
00:18:59.240
uh simple uh but the the Baseline was just asking gp4 can you generate a clinical
00:19:06.720
note using this this transcript like like a very very simple prompt as as you
00:19:12.400
can see and we ask to use the soap format which is a it's it's not the same
00:19:17.640
format as you guys would expect it's the format that uh that the doctor use that
00:19:25.240
is the most used to document uh visit it's and I have an example here it stands for subjective objective
00:19:32.200
assessment and plan uh so it's a standard way to document a visit for for
00:19:40.039
doctors yeah so it's not so API exactly it's it's not it's not a soap XML
00:19:48.360
API um so so yeah here here's a an example
00:19:54.000
so simple prom just D the transcript that we capture and it will like
00:20:01.360
generate like a fairly reasonable clinical note in in most
00:20:08.600
cases um which which with like five minutes of work is pretty
00:20:16.039
impressive um and then the there is the I mean it's
00:20:21.880
a good basine but there are a bunch of issues um so as we dig a little bit more
00:20:27.720
so uh so the first one is ucation so it makes stuff up which is which is not
00:20:33.640
really good in in
00:20:39.960
lare
00:20:46.159
then um then the output can be very inconsistent like uh the weight will
00:20:52.520
write the plan like the tone the yeah can be very consistent and then uh and
00:21:00.919
then we don't know what like really best practice it uses because we don't really know on what it was strained but it doesn't use or clinical best practice we
00:21:07.400
have a clinical team that spends a lot of time to determine like for this condition that's what the things we
00:21:14.080
should do that's what we should tell the patient and out of the box it it obviously doesn't doesn't do
00:21:21.400
that um so one example of aition that's more kind of a funny example uh
00:21:28.679
but uh so we had generated that for trying from a a conversation and it
00:21:35.360
was saying that the patient is a 30y old male in like right the first sentence in
00:21:40.799
the U the subjective part but nowhere in the transcript it was talking about the
00:21:47.279
age of the patient or the gender so just ask D how do you know that a patient's
00:21:54.279
age is 30 years old and then it responds I apologize for the confusion the patient is not mentioned in the
00:22:00.440
transcript the age of 30 years old was an assumption I made in the previous
00:22:06.000
responses which was incorrect uh in the real life scenario the patient's age would be an important
00:22:12.000
piece of information to gather during the initial history taking so I
00:22:18.559
mean it's funny it's kind of obvious but uh but it's a it's a real
00:22:26.159
problem um so so yeah what's what are some of the thing we can do to reduce
00:22:31.200
hallucination there there's more but I'll keep it to the obvious ones but be explicit about not making stuff up um
00:22:39.799
and then we found that giving umm more context especially like the
00:22:46.679
obvious context was was also really helping uh so I mean that seemed kind of
00:22:52.960
obvious but here we just started with the problem with you're a very accurate
00:22:58.000
Professional Medical scribe uh and then uh tell it to not make any
00:23:03.760
recommendations that were not explicitly mention in in the transcript and that solves a lot of those issue like we just
00:23:10.520
saw the and then the second thing is give it
00:23:15.880
additional context so I mean there's a lot of things we already know in the medical
00:23:22.200
chart uh and instead of having GPT trying to guess it or make deduction
00:23:28.200
from the transcript which sometime is not complete we're like why don't we just like give it the the context so an
00:23:36.760
example was would be saying that it's a 26 year old male ear for what type of
00:23:41.799
visits for an an ADHD follow up so that gives more context and then we can go
00:23:47.520
further also include things like the patient's problem list include the
00:23:52.760
patient's active medication list and and so on um and that that really helps D
00:23:58.760
llm make less deduction and less less
00:24:04.520
errors um issue number two um
00:24:10.080
inconsistencies um so that's an example of like two version of the same plan
00:24:16.080
like written differently and the LM might return either of those one is just
00:24:21.880
talking like as an observer the provider will send a refill of six tablets while
00:24:28.279
the other one is from the perspective of the provider so I will send you a refill
00:24:33.320
of six tablets um and obviously when we have things that are generated that will
00:24:38.640
be sent to the patient we want to be really consistent about how we address the patient and and the
00:24:45.279
tone um and again I mean similarly it's like
00:24:52.200
about being really explicit and it it solved a lot of of those issues so being
00:24:58.240
specific about telling what kind of tone we want and how to address the the
00:25:06.880
patient um issue number three uh is a bit more tricky but uh it doesn't yeah
00:25:13.799
other of box doesn't use or or clinical guidelines and as I said we have a team of doctors who spend a lot of time
00:25:20.720
determining or clinical guidelines so we came up with like a very simple plan
00:25:28.640
um but when we look at what we would do uh we actually have called them macros
00:25:35.360
but they're like templates essentially of uh that help the
00:25:40.960
doctors write what is is needed in in the plan so so doctors like do that use
00:25:49.000
those those macros and those template to help them like think about what needs to
00:25:54.919
be in in the documentation as well well as like writing it faster because you
00:26:00.960
have you have some of the Tex is already filled out U and or clinical team spend a lot
00:26:08.200
of time putting those together we have over 700 of them for like all the
00:26:13.279
different medical conditions so how can we have the
00:26:18.520
describe uh use them so um so I know last month talk was about the like the
00:26:24.559
rag approach so like you create embeddings for your knowledge put this in the vector database in the new query
00:26:30.080
we tried that but it wasn't really working well at all for uh this problem
00:26:35.399
so we took a slightly like different approach like at the end of the day what you want is give thei the right
00:26:43.080
context um so what we did is uh first we created a dictionary of our
00:26:49.039
templates um so we actually used llm to create a description of the template so
00:26:54.520
we only had originally the name and the full template and that's not really useful so we
00:27:00.320
generated description automatically using the the llm and then uh and then we use the llm
00:27:08.720
to pick the appropriate template based on the conversation in the context of
00:27:13.760
the the patient chart um and and that did um that did
00:27:21.080
pretty well and then we uh expanded or or or little PR here and
00:27:28.640
uh added the template as context when generating the the clinical node um and that work pretty
00:27:40.600
well um so then the fourth step so uh
00:27:46.039
evaluate and iterate so um we uh we saw that we went through some iterations
00:27:52.799
here but a lot of different iterations and every time you change something like
00:27:58.039
you don't know if you're if if you broke something else um so how do you um
00:28:05.559
actually know that you're making uh progress um and then how do you make a
00:28:12.840
lot of experiments uh and and and iterate
00:28:20.279
quickly um so traditional machine learning as like very like hard metric
00:28:27.799
Precision recall the uh area under the curve and you can know if you're making
00:28:33.559
progress or not with taex that you generate it's a little more tricky uh to
00:28:40.559
know that you're actually making progress um so there's at least three
00:28:45.760
strategies um that we uh looked at um to to do that so the first one is like
00:28:54.600
pretty obvious but human feedback uh so building feedback into the feature so we
00:29:03.440
we we we generate that suggestions for the doctor and the doctor actually like read it and review it and can tell us if
00:29:10.840
it was good or bad and what was wrong uh with
00:29:16.080
it um so getting your yeah like
00:29:22.600
obviously human feedback is is great it's really great feedback
00:29:28.399
uh but it has one big caveat is that it's it's very slow to iterate and it can be really costly as well we we did
00:29:36.279
it as part of like real encounter so it didn't take a lot more time but then you
00:29:42.240
can't test like 10 things at the same time or run kind of like regression testing on like things that
00:29:49.080
you you change so it's um can be
00:29:54.799
challenging another option is uh uh compare to the ground Truth uh so the
00:30:02.000
ground truth meaning that you have a clinical know that uh you know is
00:30:08.640
good um and that you can compare what you're generating with the llm with that
00:30:14.320
so there's a bunch of different um metrics that exist out there to do um
00:30:20.559
exactly that and basically it it tries to uh determine the similarity of like
00:30:27.440
facts within the within the text
00:30:32.880
um so uh so here at the bottom there's a there's a paper that actually looked at
00:30:39.279
like all those different metrics and also a human expert and they found that
00:30:46.039
blurt so the fourth one was the one where there was the closest correlation
00:30:51.760
between U the blur score and actual uh doctor reviewing uh the
00:30:59.600
notes um so what blur does is yeah basically uh take as an
00:31:05.440
input a reference so the ground truth and a candidate so the the thing you're
00:31:11.039
you're testing uh and then it returns a score um that uh that indicates like how how
00:31:19.919
much like the two mean the same basically uh and it's a train metric so it's actually like a machine learning
00:31:26.159
that has been modeled to to do that um and then there's an example at
00:31:33.559
the bottom of something that means about the same thing but written in in different ways and and it tries to tell
00:31:39.840
you uh how close they are to each
00:31:46.799
other finally the third strategy is uh creating a a grading criteria prompt so
00:31:55.559
using the llm to grade your your result um so you can similarly
00:32:02.399
like we design a prompt to generate a note with specific instruction you can kind of do the same thing with a an
00:32:08.919
evaluation criteria so um so like coming
00:32:14.000
up with a a score like it a 120 score and then 5% is for getting the tone of
00:32:21.159
the plan right or you talking to the second person to the patients um and then uh and then you are
00:32:28.240
looking at a like hard number and and can see if you like majorly broke
00:32:34.480
something or or not
00:32:40.679
um cool and finally the the last uh
00:32:47.279
point on or Jour is I just wanted to talk a little bit about uh tooling especially for uh
00:32:54.559
Ruby um so so uh useful tools uh so
00:33:00.200
Jupiter notebook which is used like by most data scientist across industry and
00:33:06.159
for lot of Mach machine learning uh you can actually use Ruby in in dter
00:33:11.240
Notebook which something that might not be obvious there's iuby which is a Jupiter notebook kernel to that allow
00:33:18.360
you to run um Ruby and then you can also
00:33:23.600
um obviously Road your load your rails uh app into jup notebook so you could
00:33:30.360
like pull your data using active record right in your notebook and then do
00:33:35.639
analysis inside your notebook uh just using active
00:33:41.519
record um the second tools is uh data frame so for people who are familiar
00:33:48.240
with python and and um and in data science there's pandas which are really
00:33:55.200
popular so this going to this kind of an equivalent in Ruby that is pretty good called
00:34:01.480
daru um and then uh finally the third uh tool is more resource uh but obviously
00:34:08.919
to do that like promp engineering is really important and there's a lot of different like promp engineering tricks
00:34:15.399
that are I mean once you know them it's fairly straightforward but uh prompting
00:34:21.000
guide that AI is a really great guide that that like document all those those
00:34:27.040
different tricks so if you're going there down the route of LMS I would definitely recommend looking at
00:34:35.440
this um cool that's it thank
00:34:46.440
you and finally I'm gonna do my Shameless salesman plug I hope no one is car sales
00:34:54.720
in there and you're not offended by my uh but uh but yes we're we're hiring
00:35:01.839
aggressively uh in 2024 so we're planning to to double the size of our engineering team um and for all sorts of
00:35:11.119
position some may or may not be on our website yet so uh feel free to go on our
00:35:16.520
website or reach out to me directly JS at cirle
00:35:25.800
Medical of course
00:35:33.760
question questions but yeah it's a lot I'll make it short is it hipop compliant
00:35:39.480
having like an AI SK