Graphics don't seem right

agentq · Post by **agentq** » Fri Dec 07, 2007 8:43 am

When you press Y the debug console should appear on top of the game screen. When I tried this a couple of days ago, the top part of the game screen corrupted whenever text was printed to the console (add some calls to consolePrintf). It looked like the two layers were overwriting each other's VRAM. Due to the VRAM bank allocation. I would look at which block you assigned to the 15 bit layer for your scaled screen, and change the block assigned to the text layer to they don't overwrite each other.

The text is drawn using one of the DS's tile layers. It uses palette colour 255. What colour that appears as depends on the game. There's no other way to do this since text layers always use paletted images, and the palette is shared with the rest of the game. When this annoys me I comment a few lines out in OSystem so that the game cannot use colour 255, and then the text always appears white, at the expense of one incorrect colour in the game.

Also, out of interest, how much RAM does the scaler use? If there are games which will run out of memory with the scaler turned on, we should disable the scaler in those games to stop all the support emails I will get otherwise! The good thing about this is that the later games are more likely to be talkie ones, which don't really need a scaler since there is very little text.

agentq · Post by **agentq** » Fri Dec 07, 2007 9:08 am

Oh - and 5 ms is pretty amazing, btw. It does a lot of work in a very short time

Tramboi · Post by **Tramboi** » Fri Dec 07, 2007 9:48 am

agentq wrote:When you press Y the debug console should appear on top of the game screen. When I tried this a couple of days ago, the top part of the game screen corrupted whenever text was printed to the console (add some calls to consolePrintf). It looked like the two layers were overwriting each other's VRAM. Due to the VRAM bank allocation. I would look at which block you assigned to the 15 bit layer for your scaled screen, and change the block assigned to the text layer to they don't overwrite each other.

The text is drawn using one of the DS's tile layers. It uses palette colour 255. What colour that appears as depends on the game. There's no other way to do this since text layers always use paletted images, and the palette is shared with the rest of the game. When this annoys me I comment a few lines out in OSystem so that the game cannot use colour 255, and then the text always appears white, at the expense of one incorrect colour in the game.

Ok I'll check this.
Could you please put a #define in OSystem so I can have the white text too easily?

agentq wrote: Also, out of interest, how much RAM does the scaler use? If there are games which will run out of memory with the scaler turned on, we should disable the scaler in those games to stop all the support emails I will get otherwise! The good thing about this is that the later games are more likely to be talkie ones, which don't really need a scaler since there is very little text.

Well the good news is that is doesn't use any RAM except VRAM (and a bit of stack space)

It just blits directly from VRAM to VRAM (which makes me think my 5ms estimation must be wrong because I measured the scaler from main RAM to main RAM, I have to check the memory latencies ^^ )
The secondary screen is directly the 320*200*PAL8 buffer and the main screen is the scaled 256*256*1555 buffer.

Now is there still enough VRAM for debug text, I'll have to check

robinwatts · Post by **robinwatts** » Fri Dec 07, 2007 3:16 pm

Nice job finding the mistake in there - and nice optimisation in the 8bit case too.

I note that the DS scaler takes a whole screen and scales it on at once. Is there mileage in only rescaling the dirty rectangle? (Obviously, if you scroll the screen, that dirty rectangle would have to be the full screen...)

Robin

Tramboi · Post by **Tramboi** » Fri Dec 07, 2007 4:19 pm

robinwatts wrote: I note that the DS scaler takes a whole screen and scales it on at once. Is there mileage in only rescaling the dirty rectangle? (Obviously, if you scroll the screen, that dirty rectangle would have to be the full screen...)

I wondered about this but I think I prefer us optimizing the worst case (ie scrolling) so that we have a constant overhead... i.e. no visible slowdown when the load increases

Btw, are you proficient with the DS memory timings, guys?
I'm not

So I posted this to get a few informations to try to improve the memory bandwidth and latency usage of the scaler (I think the computations themselves are quite fast by now

)

http://forum.gbadev.org/viewtopic.php?t=14635

No useful answer yet but who knows.
Experimenting is quite tedious with my linker so gathering precise info would be nice.

agentq · Post by **agentq** » Sat Dec 08, 2007 10:35 pm

I would imagine that using the dirty rectangle would have massive gains since most of the scumm games modify very little most of the time. Of course, scrolling is important, but some of the scumm games scroll only very occasionally, since it was slow on the original PC games as well.

And as for the memory timings, I think that the VRAM is on a 16-bit bus from the CPU, so writing to in 32-bit writes shouldn't make any difference.

robinwatts · Post by **robinwatts** » Sun Dec 09, 2007 1:43 pm

agentq wrote:I would imagine that using the dirty rectangle would have massive gains since most of the scumm games modify very little most of the time. Of course, scrolling is important, but some of the scumm games scroll only very occasionally, since it was slow on the original PC games as well.

If all the games are now running full speed, then maybe it's not an issue. Otherwise, it seems a sensible thing to look at; I'd rather play a game that hits the desired frame rate 90% of the time than one which always misses it.

I don't have an NDS to try this stuff on. Do games run full speed as they are now? (Can you reset the clock speed on an NDS as you can on devices like the PSP? As such, does it run full speed at the lowest clock speed possible?)

agentq wrote:And as for the memory timings, I think that the VRAM is on a 16-bit bus from the CPU, so writing to in 32-bit writes shouldn't make any difference.

As I understand it, when accessing memory, there are 2 different types of cycles; N and S. N cycles (Non Sequential cycles) are where you access a new address. S (Sequential cycles) are where you access the *next* address from the last one.

So on a 16 bit bus everytime you do an STRH, that's potentially at a different address - hence an N cycle each time. So 4 STRHs in a row costs 4N cycles - regardless of the addresses actually accessed.

If you do an STR though, that's an N cycle + an S cycle. So 2 STRs costs 2N+2S.

If you do an STM of 2 regs, that's a N cycles + 3 S cycles.

So, dependent on the relative timings of N and S (and S <= N in all cases), it can be worth doing more complex manipulations of the data and then writing larger lumps back.

Of course, there is the additional complexity of the write buffer to think about (which can combine writes to successive addresses to avoid using N's in some cases, I believe).

In short, I think you need to test to be sure, but it's definately worth trying.

Robin

Tramboi · Post by **Tramboi** » Sun Dec 09, 2007 6:57 pm

Hi guys,

I did real in-situ measures and found this:
The 5 STRH version was taking 18 ms (very far from the fake benchmark 5ms ^^)
The write-gathering STMIA one Robin sent me takes 16ms (after fixing a copy/paste bug)
Out of curiosity, a version that doesn't write anything is 14 ms so I suspect storing is good enough.
I'll investigate loading now...

Cheers!
Bertrand

Tramboi · Post by **Tramboi** » Sun Dec 09, 2007 7:06 pm

A version that loads only one byte instead of 5 and replicates it in the 5 pixels takes 11.5 ms instead of 16ms
A version that loads the 5 pixels but doesn't lookup in the palette is the same speed that the version that does, so accessing the stack seems very inexpensive.

I basically unrolled the loop twice (no peephole optimizations afterwards) to gather 16bits loads and it went down to 14ms

robinwatts · Post by **robinwatts** » Sun Dec 09, 2007 8:10 pm

In the latest version I see you've done LDRH and masking, instead of 2 LDRBs. Normally, I'd expect that to be slower as the first LDRB would pull a whole cache line in, and subsequent LDRBs would run very fast.

But, if it's running in VRAM there is a strong probability that this will be uncached, and so you may indeed do better with LDRH's.

We can still do better with instruction scheduling, and can lose one more LDRB too. New version at:

http://www.wss.co.uk/scummvm/blitters_arm.s

I did debate just committing it, but judging by my recent performance with silly typos, probably best not too :)

Robin

Tramboi · Post by **Tramboi** » Sun Dec 09, 2007 8:43 pm

Yes, the VRAM is indeed uncached hence the gain.

Thanks for polishing this version, I just did the basic stuff to check the proof of concept of gathering reads.
I'll test it as soon as I can use my main computer again.
But I really wonder what next optimization to do next.

Maybe I will return back to the C++ version to experiment with bringing VRAM to cache through DMA instead of the bus.

robinwatts · Post by **robinwatts** » Sun Dec 09, 2007 8:47 pm

Tramboi wrote:Yes, the VRAM is indeed uncached hence the gain.

Thanks for polishing this version, I just did the basic stuff to check the proof of concept of gathering reads.
I'll test it as soon as I can use my main computer again.
But I really wonder what next optimization to do next.

Dirty rectangles :)

Tramboi wrote: Maybe I will return back to the C++ version to experiment with bringing VRAM to cache through DMA instead of the bus.

If you held the nonscaled version in normal RAM, and then scaled to the VRAM, the cache would kick in and give a speedup, probably.

Tramboi · Post by **Tramboi** » Mon Dec 10, 2007 1:28 am

robinwatts wrote:We can still do better with instruction scheduling, and can lose one more LDRB too. New version at:

http://www.wss.co.uk/scummvm/blitters_arm.s

I did debate just committing it, but judging by my recent performance with silly typos, probably best not too

Robin

Fixed it and commited it, it went from 14ms to 13ms...

Tramboi · Post by **Tramboi** » Mon Dec 10, 2007 1:31 am

robinwatts wrote:If you held the nonscaled version in normal RAM, and then scaled to the VRAM, the cache would kick in and give a speedup, probably.

Unfortunately RAM is scarce and the data is already unscaled in VRAM (used in the unscaled view) so we'd better max out the performance we get from the VRAM (and a bit of stack).
Too bad the DMA halts the CPU, it would have been nice double buffering line processing in the stack with DMA transfers, PS2-style.

robinwatts · Post by **robinwatts** » Wed Dec 12, 2007 1:40 am

Tramboi wrote:Fixed it and commited it, it went from 14ms to 13ms...

Shall we try for 12ms?

New version at:

http://www.wss.co.uk/scummvm/blitters_arm.s

I've unrolled it one more time, which means we can use LDMs instead of LDRHs.

That should trade 8 N16 cycles for 8 S16 cycles.

My very rough calculations reckon that should be worth 1 or 2ms... (but I could be way off)

Robin