Omop Cup Discussions
Welcome to the OMOP Cup discussion page. Registered users will be able to post public questions and comments here. Although we encourage questions be posted here, feel free to email omopcup@gmail.com as well.
-
Hi, is this contest open to people outside US? Say New Zealand :-)
-
Quan: While we would like to make the competition open to as many people as possible, there are some legal limitations on the issuance of prizes. Please read the rules document for more information. You may also participate in teams, which can resolve some of the geographic limitations.
-
Eric –
Mike Linacre <orwik@winsteps.com>, was part of the team that was second in the Netlix Prize. He wants to know how to download the OMOP datasets. What is the link to download them?
Could you please send me this information? Borya has then asked me pass the info on to him and to invite him to both challenges. Thanks a lot. Sharmila -
Thank you, Sharmilla. Have downloaded the files. They are linked from the Challenges tab. Then Challenge 1. Then scroll down. Same for Challenge 2.
Question about: OMOP_TRUE_RELATIONSHIPS.txt in Challenge 1.
My download has 4000 1’s, but only 3920 0’s – is this correct? The PDF says 4,000 0’s
I have downloaded twice. Same both times …
My last line is 7921, containing: 898 2759 0 -
I had our engineer look up the file and it appears to be stored on amazon and that too has the 3920 0’s. Maybe something is up with the file, since both Orwik and amazon have the same number of 0’s. Can you send me the file directly and I will try to see if we can figure out the problem.
-
Here is what Michael ( the engineer helping me) sees:
grep -e “0$” OMOP_TRUE_RELATIONSHIPS.txt | wc -l # => 3920 -
Mike: That is the correct number of lines. There is a comment in the FAQ discussion about this. The original file accidentally contained 80 lines with a condition id larger than 4519. These were removed from the file. I apologize for not updating the pdf accordingly.
-
Question about inter-drug interactions:
Often, in the real world, certain adverse conditions are observed when a combination of drugs are taken simultaneously. For example, condition C occurs only if drugs A and B are taken simultaneously. Does the OMOP Cup data model this phenomenon?
If yes, how do we print the results for the above situation? Do we print out two tuples: <A, C> and <B,C>?
Or should inter-drug interactions be ignored?
Any help apprecidated :)
-
Anand: No, this is not directly modeled in this version of the simulator. The focus is on drug side effects, not drug-drug interactions. There can certainly be times where the adverse event is more prevalent in the presence of a second drug, but not where two drugs are required for the event to occur. In other words, I wouldn't go as far as to say they should be ignored, but they should not be reported. The problem of drug-drug interactions is an important one, which didn't make it into this simulator for technical reasons but is planned for future simulators. Considering all interactions makes the problem even more challenging, as now there are now D * D * C possible relationships (113 billion for the present simulation). You can read more about the simulator at http://omop.fnih.org/osim.
-
Will it be possible to share the W_00, W_01, W_10, W_11 numbers for benchmark BCPNN-M method? You can restrict the file to the drug - condition pairs present in the ground truth file.This will go a long way to help me (and maybe others struggling to fix software bugs).
-
Robin: The BCPNN benchmarks were run from the Disproportionality Analysis package available at http://omop.fnih.org/methodslibrary, which you are free to compare to your method. Another factor in some of the benchmarks that may not have been clearly mentioned is the use of drug and condition eras. Eras are constructed from overlapping drug exposures or condition occurrences. For example, a patient who received a 31 day dose of a drug the first of every month for a year would have a single drug era 1 year long. Depending on the method, this may or may not make a difference, but it can be substantially different in disproportionality methods. More details on eras can be found in the CDM specifications at http://omop.fnih.org/CDMandTerminologies.
-
Just curious... anybody else here worked on any other competitions recently? (Netflix, AusDM, etc..) ?
ER
-
I was member of the second-place team in the Netflix challenge ("The Ensemble"), along with with Mike Linacre (see his earlier posts here). I haven't competed in any other contests besides OMOP. Unfortunately, I didn't find out about OMOP until earlier this month.
Jeff Howbert
-
Jeff: Good to see you here.
It appears that the winner will be the person who best "reverse-processes" the data-simulator. It also looks like the speculation will come true that "no one will beat 0.33" (see FAQ).
Mike L.
-
Jeff, Mike:
I thought your names looked familiar. I also did Netflix, though way down the leaderboard. Looks like data-mining is a small world :)
Mike - Not sure that reverse-engineering is the key here - more like disentangling confounding drug effects to suppress false positives. That would explain why a regression-based approach (BLR) seems to do pretty well - and I suspect that a lot of the players in that narrow band above the BLR benchmark are using some version of it.
I have been experimenting with modifications to the proportional-reporting methods - they are pretty simple to implement even if you don't have SAS (I am using VB.NET) and run quickly once you get the data into memory from disk. Since I don't seem to be in any danger of winning at this point, here is what I have been doing.
1) Pre-process data so that I can access patients one at a time and each patient has associated symptoms and drug prescriptions.
2) Run through a phase 'A' where you count the number of days of patient days, drug treatments, symptom counts, and the count of where symptoms occur simultaneous with drugs (coincidences). I am not using any kind of extended windows on the drugs.
3) Determine a 'relative likelihood' that a drug causes a symptom, based on the actual coincidences (Na) vs expected (Ne). L =(Na+0.5)/(Ne+0.5). The '0.5' factor is there so single events of low expected occurence don't become excessively dominant.
4) Run a phase 'B' where you run through each patient day (about 6 billion total) and tally symptom-drug coincidences. If a symptom associates with more than one drug, you can either tally it to the drug with the highest 'L' for that symptom (winner-take-all), or divide the count proportionally according to 'L'. I have found that the two approaches, and a fair number of similar ad-hoc weighting rules yield substantially similar results.
5) Now that you have tallys, calculate actual (Pa) and expected (Pe) rates (probabilities) of symptoms. N = total # of days a drug was prescribed in whole dataset. Pa=# of drug-symptom conincidences/N. Pe = # of symptoms (over whole dataset)/# of patient days(whole dataset).
6) Now instead of just dividing Pa/Pe for a proportional metric, I calculated a 'pseudo-Z' metric (loosely) based on chi^2 for one degree of freedom. Z = (Pa-Pe)*sqrt(N/Pe). This can be roughly interpreted as the 'significance' of the relationship between the drug and the symptom, with positive values indicating a positive relation (drug causes symptom), negative values a negative relation (drug prevents symptom), and zero no relation. Becuase values for this 'Z' range from ~ -20 to ~+4000 on the 'true relations dataset', the extremes seem pretty meaningless in terms of traditional normal-distribution-Z. The metric does seem to be a useful way of ranking the relations, however, in that it can highlight a weak relation occuring in many cases over a strong one occuring in a very few cases - something the proportional methods don't seem set up to do.
ER
-
Mike: While we expect people to look at how the data was generated, we hope that reversing the simulation process is not the most effective method for approaching the competition. As Ed mentioned, the success of the BLR and BCPNN benchmarks indicates that doing so is not necessary, as neither one uses any information about the simulator. Taking into account information about the domain, however, like the idea that the adverse event should come after the drug, is obviously important to include in the formulation of any method specific to this task.
The fact that we released the simulator to the public during the competition demonstrates our belief that simulator-specific solutions would not be the most effective. While there are many limitations of the current simulator, we believe it is the best one available for evaluating methods when "ground truth" is not available in real-world EMRs. Going forward, we expect the methods that do well here to also do well in more complex simulated data and in real patient data with known drug adverse events.
-
Thank you for your response, Eric.
Please don't take my comment as a negative. If the simulator matches the real world (as we hope), then reverse-engineering the simulator is also reverse-engineering the real world, something the physicists are trying to do right now at the Large Hadron Collider. So the scientific process becomes exactly what we are doing:
1. Construct a theory about the real world
2. Simulate data to match generating parameter values which accord with the theory
3. Construct a method which recovers the generating parameter values (the OMOP Cup)
4. Apply the method to real world data to obtain estimated parameter values
5. Compare the real world data with the data obtained by using the estimated parameter values as generators.
6. Use the results of the comparison to adjust the theory about the real world.
7. Loop back to 2.
-
Thanks Mike: That's a good way to think about it. We are hoping that the OMOP Cup serves as a catalyst for getting bright minds to think hard about the significant public health challenge of identifying drug safety issues in observational healthcare data. We hope the OMOP Cup uncovers some good ideas that we haven't previously considered, so we can apply those approaches to real-world data sources. Just as important, we hope this process helps us identify people interested in engaging in our research community to participate in solving this problem. To that end, we hope the nearing conclusion of the OMOP Cup serves as a beginning, and not the end, of a productive collaboration with those interested in partnering with us.
-
I don't have access to SAS. Will it be possible for someone to post the 2-dimensional contingency table's used by the BCPNN-M benchmark method? It is ok to restrict the drug-condition pairs in the ground truth data.
-
...and the winner is?.... :)
-
Ed: That'll have to wait a few days. The leaderboard is officially unofficial until we verify the results, which will happen soon. I really want to thank everyone for participating, and for all your hard work. I also hope the competition has raised interest in the types of problems that OMOP is looking at and we can get more great minds working on them.
-
Hi Eric,
THANK YOU and the other organizers at OMOP/Orwik for hosting this competition. Although I can't speak for the other contestants, I found it to be a lot of fun, learned a few new things in the process, and got a chance to polish my skills on an massive dataset. I am definitely looking forward to an OMOP-2 sometime in the near future!
ER
-
May I second Ed on that one! Thanks for getting this organized and seeing it through. I actually used the competition as the practical part of my data mining class this semester and, while we had some organizational issues of our own, this was a very good experience for all of us. Thanks again!
Christophe
-
Although we heard about the competition early March, we decided to participate and learned a lot -- It was a great experience!
If at all possible, it would be really interesting, to ask the participants who have achieved the highest scores to give some insights on how they approached the problem. Eric: do you think that's possible?
My other question is whether we can use the data as a benchmark for research in other applications?
Thanks again for this exciting challenge!
Ansaf
24 comments