SjASMPlus от z00m

**Ped7g** · 19.03.2021, 11:54

I'm not sure what you mean by those proposals.

Current sjasmplus reads source code in binary 8-bit mode, so whatever is inside double quotes in DB except few control codes which need to be escaped (`\0\n`) will be 1:1 assembled into machine code (I should probably do the test doing full 0..255 char string to be sure it works like that, yeah, I will add one).

So as ZX SW author, you have some encoding on mind (CP1251 or DOS-866 or some custom like 131 = "star" and 132-133-134 "group logo"), and you need to put those values (CP1251 azbuka chars or 131..134 bytes) into machine code, usually DB statement.

That's the result side. And the way how you edit those texts, for example Russian strings, is nice with UTF8 because of modern text editors using utf8 by default.

Current sjasmplus has this ENCODING directive, which makes possible auto-conversion from cp1251 to DOS-866 - it does nothing else, by default it does nothing, and if you do `ENCODING DOS` or use `--dos866` CLI option, any 128+ byte value in source code is transformed by hard-coded table converting CP1251 to DOS-866 (or "damaging" anything else, like UTF8).

So in case of Russian string, the task is "simple", do the utf8 -> cp1251 (or DOS-866) of source before the DB is assembled.

(BTW let's make clear one thing - I'm not going to add UTF8 support for symbol names, so anything outside of quotes does not need any extra support by sjasmplus, as anything non-ASCII is bug in source - the reasons are mostly overlapping with what I will write below, plus extra pain of sjasmplus processing source code heavily with custom implementation, not re-using common C++ library, so utf-8 symbols would cause major rewrite of parser - if somebody wants to do that, ok, but not me, the benefit of unreadable symbol names doesn't attract me at all, even if the code change would be simple)

But this "simple" case means the sjasmplus code will have to learn utf-8, and have the conversion table. And if there is Russian table, why not to add also some german, czech, etc...
... and you end up implementing `iconv` - which is by no means simple task, implementing utf-8 support correctly is major pain, I know of one C++ framework avoiding iconv and using custom code, and it took few years to polish that implementation enough to mostly work as expected.

And at that moment I don't see any benefit of putting iconv into sjasmplus, if I can call the iconv externally as I have shown in that example in previous post.

I can see some benefit of some magic directive which would allow me to define custom encoding, ie. that ★ is 130 and ☈☋☑ is 131,132,134, but I don't see any elegant and symple syntax for that, and the implementation again requires adding all the important parts of what iconv does, ie. understanding utf-8 encoding correctly.

So if you want just the "simple" utf-8 to cp1251 or dos-866 conversion, I'm failing to see why to bloat the sjasmplus code, and not call the external `iconv` as intermediate step before assembling. The result is same, but iconv is more robust and could handle all the common encodings, while sjasmplus will be always very limited in what it knows, unless I re-implement whole iconv into it (and go from ~300kB binary to ~5MB assembler).

- - - Updated - - -

So, I added the test to verify that "anything 8bit inside quotes (except the sensitive control codes) works":
https://github.com/z00m128/sjasmplus...t_encoding.asm

The "sensitive control codes" contains three values: 0, 10 and 13 ("\0\n\r" escape sequences within double-quotes).

All the other 0..255 values are assembled 1:1 to machine code.

So this part of sjasmplus works "as intended" and there's no extra bug involved or any issue.

The [utf8 text source] -> [8bit source] conversion is IMO lot easier to handle externally, and I'm slightly against adding this functionality into sjasmplus (you can of course try to change my mind, but I don't see enough arguments at this moment).

I can see the convenience of such addition, but considering the current size of sjasmplus code and its build-dependencies, I find it not worth of adding utf-8 support, especially as the external usage of `iconv` is trivial in case of non-custom encoding, and does cover LOT MORE than just Russian encodings.

Maybe on non-linux OS the benefit of built-in conversion would be even bigger (if they don't have easy tool like `iconv`), but then again, it's lot more easier to install linux and use it for assembling ZX project, than to modify sjasmplus sources, so my general advice is to use modern OS, and not to reinvent the wheel again and again inside sjasmplus just because some other SW is obsolete.

- - - Updated - - -

One more note... I was even going to propose to call `iconv` from the source with SHELLEXEC (to convert some small "strings.asm" and then INCLUDE the converted one), but it turns out that's not so easy. The SHELLEXEC does execute only in the last third pass, so the converted strings are not available in earlier passes. I guess you can still do this in lua-script, calling `iconv` in first pass to generate "strings.8b.asm" (cp1251 encoded) from "string.asm" (utf8 - what you edit in editor), and then INCLUDE the converted file - let me know if you need example how to do this.

But I generally prefer to not use lua scripts in my asm, so I would instead rather create Makefile with the rule to produce that converted file before assembling of main project.

ZXPRESS •	ZXART •
ZXTUNES •	ZX Spectrum Old Demos •
Virtual TR-DOS •	World of Spectrum •

User Tag List

Тема: SjASMPlus от z00m

Опции темы

Отображение

Древовидный режим

Информация о теме

Пользователи, просматривающие эту тему

Похожие темы

SjASMPlus Z80 кросс ассемблер

Исходники TR-DOS для SjASMPlus

Запуск STS из .sna, сгенерированного sjasmplus.

Breakpoints в связке Sjasmplus+UnrealSpeccy

Disturbed COverMAnia ( music disk with z00m music collection)

Ваши права