Workunit error - check skipped


Advanced search

Message boards : SZTAKI Desktop Grid : Workunit error - check skipped

AuthorMessage
Chris Kojiro
Send message
Joined: Apr 21 06
Posts: 9
Credit: 2,204,258
RAC: 474
Message 4549 - Posted 31 Oct 2006 19:08:23 UTC

    Last modified: 31 Oct 2006 19:15:20 UTC

    Workunit error - check skipped

    What does it mean?

    See:
    http://szdg.lpds.sztaki.hu/szdg/result.php?resultid=90609

    There was a post in the \"old\" message board with a similar title, but I couldn\'t deduce a coherent answer from the posted responses. Maybe I\'m just over thinking. But after 6.5 days of processiing you have to question the results, or lack of. And this is not the first one of these, just the first time I\'m posting.

    I think this is very interesting science, but whats going on with the application?

    larry1186
    Send message
    Joined: Sep 25 06
    Posts: 37
    Credit: 18,502
    RAC: 0
    Message 4550 - Posted 31 Oct 2006 19:25:26 UTC

      Check the WU.

      errors - Too many total results

      ____________
      Don't get distracted by shiny objects.

      robert.mouris
      Send message
      Joined: Nov 3 05
      Posts: 129
      Credit: 4,124,189
      RAC: 0
      Message 4551 - Posted 31 Oct 2006 19:33:31 UTC - in response to Message 4550.

        Last modified: 31 Oct 2006 19:34:56 UTC

        errors - Too many total results

        I think that the choice of \"max # of error/total/success results 5, 8, 5\" is too low, or that the number of errors contains errors that aren\'t any. In my opinion a WU shouldn\'t be considered erroneous if the cruncher doesn\'t reply within the deadline or aborts the WU deliberately (OK if the WU is really stuck, but not if the user is bored or impatient or got too many WUs...) But this is probably a BOINC feature which SZTAKI can\'t change. In that case the accepted numbers (5, 8, 5) should be higher.
        ____________

        larry1186
        Send message
        Joined: Sep 25 06
        Posts: 37
        Credit: 18,502
        RAC: 0
        Message 4553 - Posted 31 Oct 2006 20:30:52 UTC - in response to Message 4551.

          errors - Too many total results

          I think that the choice of \"max # of error/total/success results 5, 8, 5\" is too low, or that the number of errors contains errors that aren\'t any. In my opinion a WU shouldn\'t be considered erroneous if the cruncher doesn\'t reply within the deadline or aborts the WU deliberately (OK if the WU is really stuck, but not if the user is bored or impatient or got too many WUs...) But this is probably a BOINC feature which SZTAKI can\'t change. In that case the accepted numbers (5, 8, 5) should be higher.


          I agree, with the unpredictable behavior of this project and many WUs being aborted or going past deadline or just erroring out, the max number of results should be higher to accomodate this. For example, Leiden Classical has
          For this project we have a minimal quorum of 2, an initial replication of 2 and a maximum number of errors of 16.


          It would be nice to separate out the \"No Reply\" and \"aborted by user\" errors from the true application errors. But if nothing else the No Reply should not count towards the result or error count because you neither get a rusult or an error. Other than changing the number of errors/results I don\'t think there\'s much that can be done.

          @Chris Kojiro: Bummer it happened on a WU that took 150+ hrs to crunch.
          ____________
          Don't get distracted by shiny objects.

          robert.mouris
          Send message
          Joined: Nov 3 05
          Posts: 129
          Credit: 4,124,189
          RAC: 0
          Message 4556 - Posted 1 Nov 2006 8:48:45 UTC - in response to Message 4553.

            Last modified: 1 Nov 2006 8:53:03 UTC

            errors - Too many total results

            What about this one? According to our last posts, this WU should be in error with 6 invalid results. Computer 19780 has \"Maximum CPU time exceeded\" and is the only one to have produced an application error, all the other WUs have an error outside their scope. So the WU should have been deleted on 25 October. But on 29 October I was given the WU to crunch.

            What is the reason to this? I see 3 possibilities:

            • Despite what we understood yesterday, not all errors are errors.
            • The error counting starts only when a seemingly valid WU is checked by the validator, all the fake errors skip the validation process, do not start the error counting but are taken into consideration when the error counting really starts.
            • Reaching the maximum error number doesn\'t delete the WUs already created, but not yet sent out. For WUs reported after the deadline this deleting works.


            This is definitely a Boinc problem and Sztaki can only bypass it through increasing the maximum error number.

            Anyhow, I suspended this WU as I think that there is no chance to report a result that will be validated. I don\'t yet abort it, but hope that someone can convince me that I am wrong. I hate aborting WUs.
            ____________

            larry1186
            Send message
            Joined: Sep 25 06
            Posts: 37
            Credit: 18,502
            RAC: 0
            Message 4557 - Posted 1 Nov 2006 16:23:18 UTC - in response to Message 4556.

              Last modified: 1 Nov 2006 16:26:50 UTC

              By looking at three jobs that are interesting:


              • WU #2324 from Chris Kojiro that has too many results.
              • WU #1870 from robert.mouris that does not have too many results or too many errors.
              • WU #1456 from larry1186 that has too many results AND too many errors.



              #1870 has 5 errors (invalid state) and one No Reply, therefore \"No Reply\" WUs do not count toward errors.

              #1870 has 7 sent (1 In Progress) and 2 unsent, but not too many results, therefore \"In Progress\" WUs do NOT count toward results.

              #2324 has 7 sent WU and 2 unsent, therefore \"Generated but not yet sent\" WUs count as Results - this should not be so.

              #1456 has 7 errors (4 aborted, 3 app error), therefore aborted WUs count as errors.

              #1456 has 8 sent WUs and 1 unsent, this agrees with #2324, \"Generated but not yet sent\" WUs count as Results.

              I beleive the problem is the fact that the server treats \"Unsent\" WUs as results, but \"In Progress\" WUs are not counted as results. WU should only count as results when it reaches the \"Over\" status (even if No Reply is the reason) or is actually sent out. Unsent WUs should not be counted as a result. Is this something SZTAKI can set? is there similar behaviour with other projects?

              [edit]robert.mouris, my guess is that as soon as your WU goes past deadline, or you abort, or it finishes just fine, the job will have too many results and not be granted credit.[/edit]
              ____________
              Don't get distracted by shiny objects.

              Chris Kojiro
              Send message
              Joined: Apr 21 06
              Posts: 9
              Credit: 2,204,258
              RAC: 474
              Message 4558 - Posted 1 Nov 2006 17:49:31 UTC

                Thanks for the replies, they explain the \"workunit error\".

                I agree with the suggestion to increase the max number of results for this project, perhaps from 8 to some higher number e.g. 12 or 16.
                Also, increase the max number of error resutls from 5 to something like 10 or 14.

                Profile XJR-Maniac
                Avatar
                Send message
                Joined: Feb 9 06
                Posts: 27
                Credit: 225,973
                RAC: 0
                Message 4565 - Posted 2 Nov 2006 20:53:37 UTC

                  Last modified: 2 Nov 2006 20:54:05 UTC

                  Nice to see that mothing has changed here. OK, give it another try next year ;-(((

                  Resultid=31706

                  Resultid=31626

                  ____________

                  Profile Nightbird
                  Forum moderator
                  Avatar
                  Send message
                  Joined: Jul 12 05
                  Posts: 920
                  Credit: 114,924
                  RAC: 0
                  Message 4584 - Posted 5 Nov 2006 15:16:12 UTC - in response to Message 4558.

                    Last modified: 5 Nov 2006 15:27:20 UTC

                    Thanks for the replies, they explain the \"workunit error\".

                    I agree with the suggestion to increase the max number of results for this project, perhaps from 8 to some higher number e.g. 12 or 16.
                    Also, increase the max number of error resutls from 5 to something like 10 or 14.

                    Increasing temporarily the max number of error results and the max number of results can be a good idea and solution if we don\'t want to see too often that

                    max # of error/total/success results : 5, 8, 5
                    wuid=10612

                    errors : Too many error results

                    Client error : 6
                    Success : 1 -> Validate state : Workunit error - check skipped -> Granted credit = 0
                    In Progress : 1 (no chance now to get credits)

                    ____________

                    Bernhard Frey
                    Avatar
                    Send message
                    Joined: Sep 6 05
                    Posts: 8
                    Credit: 91,198
                    RAC: 0
                    Message 4585 - Posted 5 Nov 2006 16:43:28 UTC

                      Let\'s hope Adam will manually grant credit for those WUs with Too many total results. And let\'s hope he does it quickly, because the automatic delete fueature will purge those WUs (like all other finished WUs) out of the database in a few days.
                      ____________

                      Profile Nightbird
                      Forum moderator
                      Avatar
                      Send message
                      Joined: Jul 12 05
                      Posts: 920
                      Credit: 114,924
                      RAC: 0
                      Message 4588 - Posted 5 Nov 2006 17:41:33 UTC - in response to Message 4585.

                        Last modified: 5 Nov 2006 17:43:19 UTC

                        Let\'s hope Adam will manually grant credit for those WUs with Too many total results. And let\'s hope he does it quickly, because the automatic delete fueature will purge those WUs (like all other finished WUs) out of the database in a few days.

                        I would add Too many error results.
                        ____________

                        Chris Kojiro
                        Send message
                        Joined: Apr 21 06
                        Posts: 9
                        Credit: 2,204,258
                        RAC: 474
                        Message 4636 - Posted 10 Nov 2006 15:16:17 UTC

                          Another Workunit with Too many total results. This one affects not only me, but also another user.

                          There is another newer thread dealing with the same problem.

                          There will always be some cases where \"Too many total results\" or \"Too many error reults\" will need to be applied for an unusual Workunit. But, there seem to be too many of these Workunits in this project.

                          Something needs to be done to ameliorate the denial of credits to conscientious users, through no fault of their own. I echo Nightbird\'s request to Adam that the limits be increased.

                          Chris

                          Profile Gary
                          Send message
                          Joined: Nov 9 06
                          Posts: 1
                          Credit: 1,555
                          RAC: 0
                          Message 4658 - Posted 13 Nov 2006 17:47:06 UTC

                            I just started on this project and after 35 hours on this
                            WUgot no credit.

                            PANTS.
                            ____________

                            Chris Kojiro
                            Send message
                            Joined: Apr 21 06
                            Posts: 9
                            Credit: 2,204,258
                            RAC: 474
                            Message 5522 - Posted 22 Jan 2007 17:50:23 UTC

                              Last modified: 22 Jan 2007 17:59:16 UTC

                              Howdy all,

                              Well, now a workunit with Too many success results. If you examine the individual results with an outcome of Success you\'ll find all, except one (mine), have that maddening notation of \"APP: Output is empty\". This has been noted (eg) quite frequently recently as a big annoyance involving a problem with checkpoints in conjunction with reinitializing boinc (see also). Besides the fact that this involves one of my results, I\'m bringing this up to ask why the reversion back to the old workunit limits of 5,8,5? I.e.

                              max # of error/total/success results 5, 8, 5

                              These limits had been bumped up to 10,20,10 a few months back, (eg). A recent post noted that there is some logic to a smaller limit, but as noted by the same poster there is also some basis to the increase in limits.
                              At the very least there has to be some mechanism to provide credit for those who persevere and obtain \"non-empty\" results; otherwise, everyone well just abort any workunit that already has a significant number of \"empty\" results

                              ........

                              Another question, why are not the \"APP: Output is empty\" results considered errors? The BOINC program may consider them a Success, but this project knows better.

                              Happy computing, Chris

                              Chris Kojiro
                              Send message
                              Joined: Apr 21 06
                              Posts: 9
                              Credit: 2,204,258
                              RAC: 474
                              Message 5524 - Posted 22 Jan 2007 19:13:42 UTC

                                Addendum:
                                With regard to the \"APP: Output is empty\" see this,and also this and the posts that they note.

                                Thanks, Czesc, Chris

                                Chris Kojiro
                                Send message
                                Joined: Apr 21 06
                                Posts: 9
                                Credit: 2,204,258
                                RAC: 474
                                Message 5760 - Posted 27 Feb 2007 14:39:08 UTC

                                  The plot thickens. Another workunit with Too many success results; but, there are actually three results, without the notation \"APP: Output is empty\", that actually look like they could be good results. But, it appears that the potentially \"good\" results aren\'t even checked because of the validator accepting the \"empty\" results as \"successful\". Argh (or some similar spelling indicating frustration).

                                  The hopefully \"good\" results are:
                                  1)
                                  2)
                                  3)

                                  Although, as I copy the url\'s for these results I note that 3)
                                  may be suspect. I.e. an instance of a reset on a single line workunit that did not produce an \"empty output\" notice. Well, I\'ll leave it up to the \"powers that be\" to analyze this in detail and provide a final judgment. Nevertheless, I remain frustrated.

                                  May your computing be fruitful, Chris

                                  Fritz
                                  Send message
                                  Joined: Apr 3 06
                                  Posts: 8
                                  Credit: 38,256
                                  RAC: 0
                                  Message 5792 - Posted 2 Mar 2007 0:56:44 UTC - in response to Message 4551.

                                    [quote]In my opinion a WU shouldn\'t be considered erroneous if the cruncher doesn\'t reply within the deadline


                                    This one is compounded by another problem I\'m seeing across various projects. The server downloads a WU, but the client never receives it. From what others have posted I think that this happens when the connection is having problems & the BOINC client times out the connection before the download takes place, but after the server starts the download process.

                                    What you then get is a ghost \"In Progress\" WU that is listed in your results, but was not received by your computer.

                                    This is a BOINC problem, but it just adds another error to the list of possible screwups for this project.
                                    ____________

                                    Chris Kojiro
                                    Send message
                                    Joined: Apr 21 06
                                    Posts: 9
                                    Credit: 2,204,258
                                    RAC: 474
                                    Message 5915 - Posted 15 Mar 2007 17:08:50 UTC

                                      \"APP: Output is empty\" results in another peculiar response from this project. Now credit is being given even though the \"APP: Output is empty\" result is reported. While this seems at first glance to be a good response, in that it rewards users attempts to run the application even when the result is \"empty\" through no fault of theirs. But, then the users who do produce a full (potentially valid) result are short changed in the amount of credit they are given. And, worse, it encourages people to abort a workunit when they see most of the previous results were \"APP: Output is empty\", because they know they will be given some credit, as long as they have run for a short period of time. No reason to let it run to completion, since they will get the same amount of credit if they abort the workunit early and get the \"APP: Output is empty\" result.

                                      Of course I noticed this because of its effect on my results. In which case I completed the calculations, but was given the same amount of credit as the users with the \"APP: Output is empty\" result. These are some examples:

                                      1.) workunit 11170
                                      2.) workunt 10168
                                      3.) workunit 15039

                                      While I\'m happy to get some credit, it just doesn\'t seem fair. Plus as noted above, it encourages users to abort any workunit where the previous results have returned the \"APP: Output is empty\" message. Ultimately, a fix has to be found for the checkpoint problem that seems to lead to the \"APP: Output is empty\" message. Thanks for allowing me to vent.

                                      Happy trails to you, Chris

                                      robert.mouris
                                      Send message
                                      Joined: Nov 3 05
                                      Posts: 129
                                      Credit: 4,124,189
                                      RAC: 0
                                      Message 5916 - Posted 15 Mar 2007 17:45:57 UTC - in response to Message 5915.

                                        Last modified: 15 Mar 2007 17:53:04 UTC

                                        And, worse, it encourages people to abort a workunit when they see most of the previous results were \"APP: Output is empty\", because they know they will be given some credit:
                                        ...
                                        2.) workunit 10168

                                        I knew that there is a problem, but wasn\'t aware that it\'s that grotesque. One result was processed for 429,584 seconds and another one for 2.77 seconds. Both get 115.73 credits. If people abort such WUs deliberously, the crunchers who work for scientific progress in mind will stop their participation one after another, and the remaining crunchers will just send back useless results that give them credits. This is a horrible view. But in the long run it can\'t work, as it relies on people crunching the WUs for a long time. If everyone stops after a few seconds, the validator will award them 0.0001 credits, rounded to 0.00 credits, and then some of them will complain \"Why don\'t I get my credits?\". So there is a safeguard against this ultimate scenario.

                                        One small precision: aborting a WU gives no credit. To get a seemingless valid but useless result, one must either quit Boinc, or restart the computer or switch to a different project with \"Leave applications in memory while suspended?\" set to NO.

                                        We need urgently a solution to the checkpoint problem. So that I can resume my participation.
                                        ____________

                                        Post to thread

                                        Message boards : SZTAKI Desktop Grid : Workunit error - check skipped


                                        Home | My Account | Message Boards


                                        Copyright © 2017 SZTAKI Desktop Grid