FAQ
Answers to common questions.
-
Q: There are some relationships in the OMOP_TRUE_RELATIONSHIPS.txt file that occur rarely (if ever) in the 10 million patients. How are any methods supposed to detect this?
A: There will be some relationships in the data that are rare and/or weak enough that they will not show up even in a 10 million patient sample. Therefore it will likely not be possible for any method to achieve the highest possible score. However, we feel that this situation exists in real clinical data, and also gives us an indication beyond the OMOP Cup of the number of patients required to detect rare adverse events.
-
A related question: it says there are 5000 drugs and 4519 conditions. Do each of these occur at some point? I found only 4907 unique drug exposures and 4494 unique conditions. Is that right, or did I do something wrong in scanning the files?
-
Joran, there is no guarantee that all of the drugs and conditions are used in the 10 million patient, 10 year sample. Since the list of drugs and conditions is generated before the patients, some rare drugs and conditions will never appear.
-
Since the truth table has half of the associations are actually true (1), and another half are not, and for Challenge 1 we can only predict with binary values, CAN we assume the Challenge 1 results have exactly half of them are 1, another half are 0? Or the distribution of the binary values in the Challenge 1 results does not really affect the accuracy rate? Thanks.
-
Can I assume Challenge 2 Year 10 is an exact copy of Challenge 1 data? This will save me downloading 3.7GB.
-
Hans: The distribution of associations overall is not the same as in the OMOP_TRUE_RELATIONSHIPS.txt file. There are many more negative relationships (ones that should have a value of 0) than positive relationships, though I can’t give out the exact proportion.
However, you are not restricted to predicting only binary values. You could predict a probability that you think a relationship exists. Also, we are using mean average precision as our scoring criterion, which behaves a bit differently than accuracy. MAP is a rank-based statistic, and could be affected by changing the proportion of positive relationships in the data. However, that proportion is constant for the competition.
-
Robin: It is almost identical. The only difference is that in Challenge 2 Year 10 the Drug Exposure table is truncated so that prescriptions that would terminate in 2010 are changed to December 31, 2009. This was done for consistency with the other yearly data. You could make this change to the Challenge 1 data if you like. Depending on your model, this change may or may not make a difference.
-
But when I report the results for Challenge 1, I have to report values as either 0 or 1. I cannot report as 0 to 9 as in Challenge 2, or any real value in probability, right? Since we use the rank-based MAP for performance metric, should we just report the rank of each prediction? It is OK either way for us. Thanks.
-
Hans: I’m sorry if we were unclear. You can submit any real number as a predicted value in both Challenge 1 and Challenge 2. Reporting the rank of each prediction would be fine too, since that would be a special case. We had used binary predictions and 0-9 predictions in the examples for the sake of simplification, but it is not a limitation.
-
I’m assuming that the Benchmark submissions currently posted are from the contest organizers? If so, could we see a quick explanation of the difference between the random and PRR submissions for Challenge 1?
-
Joran: Yes, the benchmark submissions are submitted by the contest organizers. They would not be eligible to win the competition. OMOP is working on methods in parallel with the competition, and we will post some of those here. Random refers to a submission where every drug/condition pair is assigned a random value between 0 and 1. PRR stands for Proportional Reporting Ratio, a member of a class of methods called Disproportionality Analysis. I will be posting a few more benchmarks as well as more thorough explanations in the coming week, but you can read more about these methods at http://omop.fnih.org/methodslibrary
-
The challenge description documents indicate that there are 4519 conditions, and in fact the maximum condition concept ID value appearing in the condition occurrences data is 4519; however, the true relationships file lists 54 additional condition concept ID values above 4519 (ranging from 4523 to 4618). Are we to ignore those entries in the true relationship file, or are results for condition concept IDs above 4519 to be reported also? For example, line 7870 in the true relationships file shows drug concept 681 paired with condition concept 4618 and no association present.
-
Steve: Thank you for discovering this issue with the true relationships file. There was typo when that file was produced that allowed the condition id to range up to 4619 instead of 4519. This resulted in 80 drug/condition pairs (associations are absent for all of them) in that file where the condition id is greater than 4519. Please ignore these entries in the file. I’ve uploaded a new version without these 80 lines. There was a similar typo in the competition rules document that has also been fixed.
-
If we submit a set of scores as an entry how long before we can expect to see it on the leaderboard?
-
Timothy: Submissions should show up on the leaderboard within an hour of submission. In your case, I have a record of a submission last night, but there is no file attached. Could you please try resubmitting the file, perhaps using a different browser. If it doesn’t work again, please send me an email at omopcup@gmail.com with OS/Browser info and any error messages, and we’ll try to resolve the problem.
-
Here are two records from Challenge 1 drug file.
5185300001 2005-07-28 2005-12-24 51853 2
5185300002 2005-07-15 2006-04-26 51853 2Does this mean person ‘51853’ was given a double doze of drug ‘2’ between 2005-07-28 and 2005-12-24?
-
Robin: That is one way of viewing it. Specifically, it says that the patient had 2 overlapping exposure periods of the same drug. This is something that happens in real medical data, though some of the particulars might be specific to this simulated data. The simulated data also does not have ‘dose’ information per se (additional fields in the CDM which are not used), so it is up to you how to use this information.
-
Question about submission
I submitted an entry at 9:47pm EST on Nov 29. Why does it say the submission time as Nov 30 2:47am? I hope I have one more iteration before end of tomorrow.
Also do we need to remove old submissions when submitting new entries? (I can see my old uploaded files too when I submit a new entry)
-
Apparently the time is reported as Greenwich Mean Time.
-
I just discovered some lines of my submissions were incorrectly formatted, having an extra comma at the end of the line. Can you tell me whether, in scoring, the extra comma was ignored, or else whether the line was thrown out?
-
Robin: Yes, the website is running under GMT, but the time is converted to Eastern time when it is reported on the leaderboard.
And I would recommend removing old submissions from the upload list, to ensure the correct one gets scored.
Lisa: I’ll check on that.
-
Lisa: The extra commas were ignored, and the submission was scored as intended. The program ignores trailing empty “columns”, as would be indicated by extra commas. Most other types of misformatting are thrown out, but that one slipped through :)
-
Hi – I’m looking for clarification on how you define “a relationship exists” in the answer key. Does a relationship consist of at least one case of a drug-condition association, or is there some threshold that has to be crossed?
-
Rob: The data used in the competition were generated by a program that determined ahead of time which drugs and conditions were causally related, then probabilistically generated patient experiences based on these parameters. The goal of the competition is trying to reconstruct that information from the patient data alone. Thus our definition of a “true relationship” comes from information not available in the competition. While there is obviously a connection between cooccurrences in the data and the presence of a relationship, this is not a threshold in the data itself. For more information, check out omop.fnih.org/osim
-
Hi,
Unless I misunderstood the rules for Challenge 2, I believe there is possibly a design flaw in the Challenge. There is nothing that stops us from using year 10’s data to predict year 1-9 (and get a better score? yes.. back to the future :) ). If you don’t want this to happen then you should state this clearly in the rules, although I don’t know how you can enforce it during the competition… -
Harris -
Regarding Challeneg #2 – I just did exactly that to see what would happen – I pulled the 1st 500 drugs out of a previous challenge #1 submission file and padded them out from years 1-10 and just got a huge increase in challenge #2 score vs a previous entry. I had previously just assumed they modulated the correlation over the different years to prevent someone from deliberately gaming the competition this way. Looks like they didn’t – OOPS!
I also re-submitted a challenge #1 datafile I originally sent in back in december and saw the challenge #1 score drop from 0.23177 down to 0.21590. Looks like something changed in the scoring system sometime in the past month or so. If this is true, then your older results may not be especially useful in benchmarking newer work.
TO THE ORGANIZERS: What changed and when?
ER
-
Harris: This was an issue we anticipated for Challenge 2. The only way to avoid this would have been to generate 10 additional independent 10 million patient histories, which was not computationally feasible to complete by the time the competition went live. This does make it difficult for us to detect if someone is just using the whole dataset for all timesteps (though scores that don’t change from one timestep to another are a red flag).
This problem is why we require potential winners to submit code to us so that we can verify that the score was produced by code that only utilized the correct subset of data. It is mentioned in the rules at the end of section 7: “For Challenge 2, Sponsor must be able to determine that the yearly results datasets were produced using only the appropriate yearly input file, without use of data from later time-points.” However, we were in error in not making that much more prominent. I will add it to the challenge 2 description file and the challenge 2 submission page.
We are aware that using challenge 1 results to game challenge 2 is possible, and that we can’t know for sure that a submission for challenge 2 is legal until the end of the competition. But we will verify that only the proper data was used before awarding any prizes. This uncertainty in the scoring is not ideal for us or the participants, but we decided it was the best option at the time.
I apologize for the confusion. If you made a submission that you think may not be eligible for the challenge 2 prize, you can send me an email at omopcup@gmail.com and I will remove it from the leaderboard.
Ed: To your question about the score changing, that does look worrisome. I will re-score all of the submissions and try to figure out what happened.
-
Ed:
Thanks for confirming this, it makes more sense now that Eric has clarified the rules.
Regarding the scoring change, I re-submitted one of my previous entries and it gave the same score…Eric:
Thanks for the clarification, I did miss the paragraph at the end of section 7. -
Eric: while we are at this, are we allowed to use OMOP_TRUE_RELATIONSHIPS.txt in producing our results?
-
I can confirm Ed’s finding: I resubmitted my first run, and the score went from 0.2355052 to 0.2192464. This is actually also more in line with my performance on the training set, where my first run gives the lowest MAP compared to my later runs (which scored lower on the test set)
-
Martijn, thanks for the confirmation – for a while I thought I may have inadvertently swapped some filenames. – ER
-
Regarding challenge 1: The instructions say that the true_relationships data won’t be scored if included in your submission. However, I submitted a csv version of true_relationships (otherwise unmodified) and got 0.0268632. I thought that maybe my submission was scored as an empty submission, however, when I submitted an empty submission I got about random for a score. Are the true_relationships rows scored or not?
-
Looks like the scoring oracle does consider the relations in the ‘true relations’ file when it scores. By rolling the true relations file into a previous entry I boosted the score from 0.21590 to 0.22775, an increase of about 5%!.
Whether or not to consider these relations poses an interesting problem. These priors are a legitimate test for statistical techniques that don’t depend on training, but possibly not for NN-based techniques that use them as part of a training set. In either case, using the known relations to artificially boost one’s score contributes nothing to the development of better algorithms, and really isn’t in the spirit of the contest.
That being said, now that it is clear that the scoring oracle does base its score on the known relations, the public knowledge of that information makes the playing field a bit more level :)
ER
-
I figured that everyone should know, hence the post. Ed—you are welcome. If you win maybe you can give 5% of the prize money to me :)
-
Rob, Ed, and Martijn: Thanks for uncovering these issues. We’re working right now to fix both of these problems with the scoring system. Expect a more thorough resolution tomorrow.
Harris: You are allowed, though not required, to use OMOP_TRUE_RELATIONSHIPS.txt in your results for both challenges.
-
Ed, I think using known relations may not be totally useless to the algorithm development, they could be interpreted as known evidence from the biological/pharmaceutical domain, e.g. when a drug/condition has been proven to be associative/non-associative etc…
-
Harris, I may have been a bit unclear in trying to make my point. I would certainly agree with you that the known relations are valuable – whether for evaluating statistical techniques or as training sets for machine learning methods.
The problem arises when it becomes possible to use them to game the system. The rules implied that these points would not be considered in determining scores. The fact that they currently are considered makes it possible for a contestant to achieve a signficant advantage not through better algorithms, but by taking advantage of this (formerly) not-generally known bug. If there were huge differences in scores near the top of this list, this might not be a big deal. From what I have seen of data-mining, competitions, however, near the end it comes down to a few people slugging it out over fractions of a percentage point. This particular bug provides about a 5% increase – pretty enormous – think a 5 meter headstart in a 100 meter dash :)
The important point isn’t necessarily eliminating all the idiosyncracies of the scoring system – it is sure to always have some – but in making sure everyone knows how they are being evaluated so everyone is really playing the same game.
ER
-
Ed, ah right, that I totally agree :)
so let’s wait for Eric’s team for a score “reset”.. -
Am having trouble with corrupted data files, could someone confirm these counts, or correct them:
Range of person_ids: 1- 10,000,000
Number of persons with drug exposures: 9,562,432
Number of drug exposures: 92,803,110
Number of condition occurrences: 316,686,137
Thanks .. :-) -
Mike,
These are also the numbers I have.
Best of luck,
Christophe -
Thank you, Christophe. Much appreciated.
Here is another oddity. Can you (or someone) confirm or correct it?
There are 7,921 OMOP TRUE RELATIONSHIPS. But, for 188 of these relationships, either the drug or the condition or both are not present in the DRUG and CONDITION data files.
This prompts the question: Is the “M = DC” in the scoring formula based on all possible D (5000) and all possible C (4519), or on the observed D (4775) and observed C (4494 ) ?
-
Mike: There are indeed relationships given in the OMOP_TRUE_RELATIONSHIPS file that do not appear in the data. The relationships provided in that file were selected from all relationships present in the data generation process, regardless of their frequency in the resulting patient data. This means some relationships for rare drugs and/or conditions may never appear. The formula is calculated based on all 5000 drugs and 4519 conditions. We understand that this may make it impossible to achieve a perfect score on the challenge, but it keeps the scoring function from changing with the size of the database, which is important to other work OMOP is doing.
-
Thank you, Eric.
Then, since the divisor is 5000*4519, inspection of the Announcements so far suggests that there are 0.0159207 * 5000 * 4519 = 360,000 drug-condition causations to be found. -
A further speculation, based on inspection of the Announcements so far, suggests that only 120,000 of the drug-condition causations are observed in the datafiles. This suggests that the winning submission score will not exceed 0.33 unless the winner is very lucky or devises a method that surmises unobserved drug-condition causations.
-
In light of some issues discussed in this thread, we’ve updated the scoring algorithm for the competition. See the Feb 19 entry on the Announcements page for more info. Feel free to email me with any questions related to the change. Thanks to those of you that helped us find the issues.
-
Eric: The scoring algorithm is instructive, but raises a question.
It appears that submitted values must be integers:
t = Integer.parseInt(tokens2) – 1;
But the Java documentation says:
“Lines with non-floating point scores are ignored”
and the Challenge PDF says:
" The predicted value may be binary indicating presence or absence
of a signal, or a real number indicating the strength of the signal."Am I mis-reading the Java code?
-
Oops! Didn’t read far enough …
double prob = 0.0+Double.parseDouble(tokens[tokens.length-1]);All is OK!
-
Hi Mike,
The line with tokens[ 2] is reading the timestamp/year, and it should be an integer.
And the documentation there is actually a double negative :) so it accepts floating point scores.. to be more precise, the score reading happens at line 245 and 291.Hope that helps,
(and hope that’s correct, Eric!)
Harris. -
I have two general questions for the score system. First, will the demographic data affect the correlation between a certain drug D and a certain condition C? For example, C and D are correlated for male but not for female. If it is true, will your score system treat C and D correlated or not? Second, it seems the correlation between drug and condition could be strong, moderate, or small in your simulation procedure. If it is true, then your score system treats them the same. It seems the way you evaluate is discouraging the method which can rank the strong correlation higher. It is just a thought unrelated to the competition itself because everyone follows the same rule. But I think it is an important issue for the real problem itself.
-
Hawkeye: In our simulation, while demographics like gender and age do impact the probability of a drug or condition occurring in a particular patient, they do not change whether a condition is related to a drug. For example, condition C might never appear in men, but it is still related correlated with drug D in women, and we consider it a true relationship. This does bring up an interesting hypothetical where a drug would increase the risk of a condition in one group and decrease it in another, cancelling each other out when examining everybody. I don't know whether there are real examples of this, but it isn't modeled by the current simulator.
Yes, the current scoring system treats all associations the same, regardless of their strength. However, an association might be "strong", but the drug or condition very rare, and thus still difficult to detect. We certainly intend to look at how different methods do at detecting signals of different strengths, but weighting certain relationships differently has other problems (not that I'm claiming that our scoring system choice is perfect by any means). For example, if your score gives more weight to detecting stronger relationships, that will tend to clump method scores together and reduce the incentive to identify weak relationships. Another possibility is to use a rank correlation metric (like Spearman's rho) where the true relationships are ordered by the strength of the correlation. However, the overwhelming proportion of the relationships are "tied" at zero, and would dominate the score, so you would need some way to control for this. It is an interesting thing to think about though, particularly if you have preferences on identifying certain types of relationships.
50 comments