SimonW Posted November 22, 2022 · Member Posted November 22, 2022 I am starting this thread since I didn't want to let all the great comments here unanswered. Please feel free to post feedback, discuss ideas, etc. in this thread. Quote
SimonW Posted November 22, 2022 · Member Author Posted November 22, 2022 (edited) On 11/21/2022 at 1:11 PM, Ed Snible said: @kirispupis asks for automatic detection of duplicates, and die duplicates, which is a major unsolved research problem and depends on the existence of a database of auctions -- a database that doesn't exist. That database may never exist if we don't build it now. There are multiple archives, at least for the major auctions of the past 2 decades. I don't think there will ever be one that covers all numismatic auctions for two reasons: 1) copyrights and 2) it's a huge amount of work to digitize printed catalogs. On 11/21/2022 at 1:11 PM, Ed Snible said: It would be easy now for an individual to download PDF catalogs from dealer sites, and to "print as PDF" or save in a database the coins appearing on VCoins and eBay. Yet as far as I know no one is doing that. (Perhaps some folks are doing that but can't discuss it because of copyright reasons?). The technical work isn't hard -- a script could do it -- but it is boring. Collecting all sales records online is probably not as easy as you think. Creating a crawler for a single website is easy and quickly done, I agree. But if you have to create and maintain individual crawlers for many different websites (keep in mind that they don't store the entire site like Google, etc. does, but need to extract certain data and feed it into a database) the amount of work quickly grows. Especially when they have a lot of code updates and/or use JavaScript. 21 hours ago, Severus Alexander said: As far as I know, all the auctions on biddr stay online. @SimonW, is that correct? Just go to https://www.biddr.com/closed and narrow down by auction house using the dropdown. For now, yes. But please keep in mind that biddr is not an auction archive. It's an auction platform and old records may get removed at some point for performance reasons. This is where acsearch and all the other auction archives come into play 🙂 20 hours ago, Ed Snible said: I purchased coins on pecunum.com between 2013-2016. That entire trove of data is gone. Pecunem was only the auction platform, not the auction house. The auction houses that used Pecunem were Numismatik Naumann (formerly Gitbud & Naumann) and Solidus Numismatik. The auctions of both of them can be found on acsearch and other auction archives. 20 hours ago, Ed Snible said: Of course, not every auction is on biddr. Frank Robinson's catalogs go away when the next auction is loaded. If someone sold me a coin and said it was ex-Frank Robinson Auction 80, it would be hard to verify even with the printed black and white catalogs. Frank's auctions have been available on biddr for the past 3 years. And for the same time their data has been collected for acsearch, although not (yet) visible to the public. We are collecting the auction data of more auction houses than you may be aware of. This data is either not publicly available because we haven't asked the copyright holder(s) for permission yet, or because we weren't granted permission when we did ask them in the past. We still collect it in case we get permission at any time in the future. 20 hours ago, Severus Alexander said: The acsearch image search isn't a machine learning or neural network type thing (which to my mind is what's really needed for this purpose), it's hard coded based on matching outlines. That's correct. The search basically works like a text search, but instead of text it uses certain image features to search an index of image features. Since the image search has been mentioned quite a few times, please note that everyone with a premium account has 5 free image searches per day included. 19 hours ago, kirispupis said: ...because their "normal" search is too basic. That's the major reason I haven't become a paying member as I just don't see the current functionality as valuable... Quite the contrary, the search allows very complex queries if you use it correctly. It's often not the search that is limited but the understanding of the individual using it. Please read the How to. 19 hours ago, kirispupis said: Search by catalog reference. Right now this is a basic string search and mostly useless. The search by catalog reference is easily achievable with the proximity search operator as described in the other thread. 19 hours ago, kirispupis said: Be smart enough to know similar catalog references. Many coins are listed by multiple services. If I search for one, understand that the others may match too. That would be a nice feature. It's basically a Thesaurus for catalog references. Although I see some potential problems (e.g. one catalog reference may be broader [cover multiple types] than another and, thus, lead to a less effective search). So this would need to be an option you can turn on/off. As an alternative, search for one reference to see what other references there are and then use the group operator to search for all references you found at once. 19 hours ago, kirispupis said: Understand debated attributions. For example, coins from Myos and Myndos often have identical catalog references (such as SNG Copenhagen 1022). Therefore, if I search for Myos I should receive Myndos and the other cities these are often attributed to. I agree, it would be nice if the search knew all this and automatically searched for both. I even think a conditional Thesaurus wouldn't be too hard to implement. The only problem is that all this knowledge would have to be fed in first, for all numismatic areas. As an alternative, you can achive the same with the group operator, if you have the knowledge. 19 hours ago, kirispupis said: Include highlights for known fakes. Right now these are often removed by auction houses but remain in search results. It would be great to keep them in the results, but put a red border or other designation on them. This could be crucial when searching for coins. The problem is that fakes are usually removed because of certain legal implications. And if they haven't been removed before the data enters the archives, I imagine that an auction house won't be uber happy if we put a big red border around the fakes they had in their auctions. In the end, it's their data and they decide if it stays in the archive or not. There's a fine line. 17 hours ago, Ed Snible said: When I signed up for acsearch, more than 10 years ago, I clicked to agree to the terms. I assume the terms I agreed to then are similar to the current terms, which say "2.1 Any intervention in the function or manipulation of the website is forbidden to the highest degree. This in particular includes ... the use of so called web scrapers, web robots and other software for the systematic collection of data and content ..." Without the data it is hard to create and test software to analyze the data! I have never asked Simon Wieland for permission to access his data. I don't have time to work on this project seriously. Perhaps someone could approach him to see if he would be willing to sell programmatic access to his non-price data to companies and non-profits doing research and new product design. The data of acsearch is what makes it valuable and, thus, is not sold or given away in its entirety. But I believe you won't need 10 million auctions records. I am happy to provide a good sample (I believe 1% is more than enough) to anyone who is interested. 17 hours ago, kirispupis said: My hope with the suggestions is that the owners of ACSearch or CoinArchives pick them up, or someone else does. It really has to be a labor of love, since there aren't enough ancient coin collectors to justify the expense. I think that's the main problem. I like many of your ideas, but who is going to put in the all the work? If anyone is willing to get their hands dirty and put in some actual work (e.g. by manually labeling data to train an AI model), let me know. I am the first one who will support you 🙂 Edited November 22, 2022 by SimonW 8 2 1 Quote
Benefactor kirispupis Posted November 22, 2022 · Benefactor Benefactor Posted November 22, 2022 12 minutes ago, SimonW said: I think that's the main problem. I like many of your ideas, but who is going to put in the all the work? If anyone is willing to get their hands dirty and put in some actual work (e.g. by manually labeling data to train an AI model), let me know. I am the first one who will support you 🙂 Fair enough. FWIW, I've implemented several production searches in my career. The largest one was basically a contact search where individuals could use address, office number, phone number (or just the extension), name (either just the first name, last name, both, or last name first), and numerous other fields. Total database size was in the hundreds of millions of records and was queried several million times a day. In that one, data entry was also non-uniform and a type of state machine was used to normalize data and try to figure out what the user is asking for before we performed the search, which was basically a ranked index search (though we had several indexes for different searches). Note that, in my experience, users shouldn't be expected to understand the format of the data they're looking for. I believe it's reasonable for a user to state that they're entering a catalog reference or date range, but expecting the types of searches listed elsewhere generally results in lost customers. In our case, there was just one search field and we had to figure out what they wanted first. Given that this was a mission-critical system for many Fortune 500 companies, I heard about it if the results weren't correct. I haven't done much with neural networks, short of taking some classes on it, though I have several friends who run large teams dedicated to AI and I'm sure would be happy to answer my questions. Compared to the problems they're solving, this certainly doesn't seem to be too complex. That being said, my biggest limitation is time. I have a number of projects in the queue, including finishing two novels, writing an application to learn foreign languages, and redoing my coin website. I certainly understand that yours is limited as well. Catalog searches isn't the world's most complex computing problem, but it's also going to take some time to realize. Note that, if you have the time to implement one relatively easy feature that would make ACSearch far more useful, consider pulling the dates out of the listings. This shouldn't be too complicated and would provide a trivial way to limit search results. For me, lately I've been focusing on Greek cities from a specific time period, and it's annoying when I search for a city to understand how frequent its types are and the frequency of each type, and the results are cluttered with Roman provincials. Perhaps some time later I'll have more free time to help you put my suggestions to action. 2 Quote
Ed Snible Posted November 22, 2022 · Member Posted November 22, 2022 3 hours ago, kirispupis said: Note that, if you have the time to implement one relatively easy feature that would make ACSearch far more useful, consider pulling the dates out of the listings. This shouldn't be too complicated and would provide a trivial way to limit search results. For me, lately I've been focusing on Greek cities from a specific time period, and it's annoying when I search for a city to understand how frequent its types are and the frequency of each type, and the results are cluttered with Roman provincials. Back in 2016 I was volunteering for coinproject.com by writing a program that could read freeform auction record text from JSON files and understand it enough to create the kinds of metadata records CoinProject was using for its search feature. It was able to get about 98% understanding of the coin descriptions by applying "regular expressions" to human text written by the catalogers at Gitbud and Solidus. For example, for Roman Imperial I was getting the emperor and his reign using this: '^(?P<issuer>[A-Z][a-z]+ [A-Z][a-z]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # Emp Emp (123-456) '^(?P<issuer>[A-Z][a-z]+ [A-Za-z]+)\. \((?P<daterange>[0-9]+-[0-9]+) ?\)\.', # Emp. (123-456 ) '^(?P<issuer>[A-Z]+) \((?P<daterange>[0-9]+ BC-AD [0-9]+)\)\.', # EMP (123 BC-AD 456) '^[A-Z ]+\. Minted under (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP. Minted under EMP (123-456). '^(?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+ AD)\)\.', # EMP (123-456 AD). '^(?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+)\)\.', # EMP (123). '^(?P<issuer>[A-Z ]+) \(Died (?P<daterange>[0-9]+)\)\.', # EMP (Died 123). '^(?P<issuer>[A-Z ]+)\. \(Died (?P<daterange>[0-9]+)\)\.', # EMP. (Died 123). '^(?P<issuer>[A-Z ]+) \(Died AD (?P<daterange>[0-9]+)\)\.', # EMP (Died AD 123). '^(?P<issuer>[A-Z ]+) \(AD (?P<daterange>[0-9]+[ \?]?- ?[0-9]+)\)\.', # EMP (AD 123-456). '^[A-Z ]+: (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+) AD\)\.', # EMP: EMP (123-456 AD). '^(?P<issuer>[A-Z ]+)\. Wife of [A-Z][a-z]+\. \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP. Wife of EMP. (123-456). '^[A-Z ]+\([0-9]+-[0-9]+\)\. Minted under (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP (123-456). Minted under EMP (123-456). I had other patterns to extract Greek cities, denominations, and issue years. Most of the time was figuring out the other 2%. Today I would just focus on the 98% I understood, and perhaps later use the programmatically extracted data to train an AI to guess at the others. It was boring working on this all by myself so I gave up. Perhaps today there is a large enough community of people who like both computers and coins that more progress could be made, with the work spread over more people? 5 Quote
SimonW Posted November 23, 2022 · Member Author Posted November 23, 2022 (edited) 17 hours ago, kirispupis said: FWIW, I've implemented several production searches in my career. The largest one was basically a contact search where individuals could use address, office number, phone number (or just the extension), name (either just the first name, last name, both, or last name first), and numerous other fields. Total database size was in the hundreds of millions of records and was queried several million times a day. In that one, data entry was also non-uniform and a type of state machine was used to normalize data and try to figure out what the user is asking for before we performed the search, which was basically a ranked index search (though we had several indexes for different searches). Note that, in my experience, users shouldn't be expected to understand the format of the data they're looking for. I believe it's reasonable for a user to state that they're entering a catalog reference or date range, but expecting the types of searches listed elsewhere generally results in lost customers. In our case, there was just one search field and we had to figure out what they wanted first. Given that this was a mission-critical system for many Fortune 500 companies, I heard about it if the results weren't correct. I have no doubt that you have the expertise. The problem, however, is that we don't have the same resources that your company must have had building a system critical for many other companies. For this reason, my call to let me know if anyone has not only good ideas, but also isn't afraid of helping with the work it takes to implement them. 17 hours ago, kirispupis said: For me, lately I've been focusing on Greek cities from a specific time period, and it's annoying when I search for a city to understand how frequent its types are and the frequency of each type, and the results are cluttered with Roman provincials. Can you provide an example? I am sure that you can improve the results by optimizing your search term using the right operators. Edited November 23, 2022 by SimonW Quote
SimonW Posted November 23, 2022 · Member Author Posted November 23, 2022 (edited) 11 hours ago, Ed Snible said: Back in 2016 I was volunteering for coinproject.com by writing a program that could read freeform auction record text from JSON files and understand it enough to create the kinds of metadata records CoinProject was using for its search feature. It was able to get about 98% understanding of the coin descriptions by applying "regular expressions" to human text written by the catalogers at Gitbud and Solidus. For example, for Roman Imperial I was getting the emperor and his reign using this: '^(?P<issuer>[A-Z][a-z]+ [A-Z][a-z]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # Emp Emp (123-456) '^(?P<issuer>[A-Z][a-z]+ [A-Za-z]+)\. \((?P<daterange>[0-9]+-[0-9]+) ?\)\.', # Emp. (123-456 ) '^(?P<issuer>[A-Z]+) \((?P<daterange>[0-9]+ BC-AD [0-9]+)\)\.', # EMP (123 BC-AD 456) '^[A-Z ]+\. Minted under (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP. Minted under EMP (123-456). '^(?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+ AD)\)\.', # EMP (123-456 AD). '^(?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+)\)\.', # EMP (123). '^(?P<issuer>[A-Z ]+) \(Died (?P<daterange>[0-9]+)\)\.', # EMP (Died 123). '^(?P<issuer>[A-Z ]+)\. \(Died (?P<daterange>[0-9]+)\)\.', # EMP. (Died 123). '^(?P<issuer>[A-Z ]+) \(Died AD (?P<daterange>[0-9]+)\)\.', # EMP (Died AD 123). '^(?P<issuer>[A-Z ]+) \(AD (?P<daterange>[0-9]+[ \?]?- ?[0-9]+)\)\.', # EMP (AD 123-456). '^[A-Z ]+: (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+) AD\)\.', # EMP: EMP (123-456 AD). '^(?P<issuer>[A-Z ]+)\. Wife of [A-Z][a-z]+\. \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP. Wife of EMP. (123-456). '^[A-Z ]+\([0-9]+-[0-9]+\)\. Minted under (?P<issuer>[A-Z ]+) \((?P<daterange>[0-9]+-[0-9]+)\)\.', # EMP (123-456). Minted under EMP (123-456). I had other patterns to extract Greek cities, denominations, and issue years. Most of the time was figuring out the other 2%. Today I would just focus on the 98% I understood, and perhaps later use the programmatically extracted data to train an AI to guess at the others. Thank you very much @Ed Snible for providing your regular expressions. The problem is that I don't have time to write individual regular expressions for each company, language and numismatic area, which would take weeks and weeks of work alone. I would need one (or no more than a few) that is robust enough to cover at least 95% of all 10 million records. A well trained ML/AI model would most probably be more effective and robust. Edited November 23, 2022 by SimonW 1 Quote
Ed Snible Posted November 23, 2022 · Member Posted November 23, 2022 5 hours ago, SimonW said: The problem is that I don't have time to write individual regular expressions for each company, language and numismatic area, which would take weeks and weeks of work alone. I’m not trying to offer a commercial solution. My hope is to inspire a student or independent scholar to do the work. Once the work is done, there is perhaps a nice paper to publish in a journal for the student. “Extracting metadata from sales catalogs in the antiquities trade” or some such. To get a well trained AI model some training data is needed. Tens of thousands of examples where the correct answer is known. I’m describing an approach to get that training data. Here is the approach I recommend. First, obtain data. The easiest way to accept @SimonW’s offer of 1% of the auction descriptions. That would currently be 101,288 descriptions. Next, manually come up with a “regular expression” to match the data you want in the first record. Say dates. For example, the first record for "Hadrian" in ACSearch is Hadrian AR Denarius. Rome, AD 132-134. HADRIANVS AVGVSTVS, bare head left / INDVLGENTIA AVG PP COS III, Indulgentia seated left, extending right hand and holding sceptre. RIC 213; C. 846, 850. 3.20g, 18mm, 6h. Very Fine. The pattern is very simple: “a ruler name, a metal, a denomination, a period, a city, AD, a year, a dash, another year.” Perhaps 1% of the data in ACSearch follows that pattern. Find the records that match the pattern. Remove those records. Repeat until there are just a thousand oddball records left. I found the records that didn’t match the most common patterns were usually typos. My claim is that with a few days of work it is possible to create regular expression patterns to match 90% of the metadata in sales catalogs. ACSearch has over ten million records, so that would be correct metadata for 9 million of them. Correct metadata for 9 million coins would be valuable! It is easy to get to 95% with a bit more work. Not just the years and rulers, but the inscriptions, grades, weights, and catalog references. The same patterns can be applied to the OCRed text from early catalogs available from archive.org (@rNumis does this interest you?). Six years ago Alfred De La Fe and I could not find anyone else interested in this effort. This approach is only useful if the result can sold or given to someone who already has an ancient coin search engine, or for generating data to train an AI, or for an academic paper. Perhaps today there are more people who can use the output, and more people skilled in creating regular expression matchers? Quote
SimonW Posted November 23, 2022 · Member Author Posted November 23, 2022 Please correct me if I am wrong, but if you use regular expressions to get data to train an ML/AI model, the model will only (or mainly) work for records that follow the same regular expressions. Amazon had that exact problem with their facial detection app where they used mainly faces of white people to train the model. The model then, unsurprisingly, only worked well for white people. I would use a pre-trained model (e.g. GPT-3) on a few thousand random records, verify/correct them manually and then train a custom model. More difficult than extracting the data is normalizing it (differnet languages, spelling, notation, etc.) so that it becomes usefull. Quote
Benefactor kirispupis Posted November 23, 2022 · Benefactor Benefactor Posted November 23, 2022 7 hours ago, SimonW said: Can you provide an example? I am sure that you can improve the results by optimizing your search term using the right operators. Two recent examples are "Alexandria Troas" and "Ilion". Keep in mind, when I search, I don't initially care about prices. I care how rare the issues are from a city between 350-250 BCE. At this point, I'm trying to decide how important it is to bid, so I want to see things like the following (which ACSearch tells me if the query works): How many have sold (how rare is it)? What is the condition of the one I'm looking at vs those that have sold? Have there been recent sales? Who is selling them? This tells me how hard I should go after this coin. For prices, the following is what I typically do (no offense): Search at CoinArchives, which gives only the last 6 months - mainly if the coin is common. For these coins I find that historical prices aren't as useful. Search at the auction that listed the coin, if they have such a search. Different auction houses have different clientele, and some coins are in more or less demand. For very rare coins, I'll use ACSearch to determine what they've gone for, but I can usually look at the type and condition and have an idea I rarely search for a particular type, but care more about issues from the city in general, unless this type has an obverse I know will be popular. Quote
SimonW Posted November 23, 2022 · Member Author Posted November 23, 2022 (edited) 42 minutes ago, kirispupis said: (no offense) None taken 🙂 42 minutes ago, kirispupis said: Two recent examples are "Alexandria Troas" and "Ilion". If you only care about the greek coins and not about the roman provincial issues, use the following search term: Alexandria Troas ("bc" "b.c." "bce" "b.c.e." "v. chr." "avant j." "a.c." "ac") -(colon pseudo autonom "ad" "a.d." "n. chr.") Use the city name you are interested in and add two search groups, one positive, one negative. The positive group contains all terms of which at least one should be present in a result. The negative group contains all words that must not be present in a result (all of them would be associated with roman provincial coins). Use quotation marks where you want to have an exact match and no quotation marks when the term is either unique or may continue (the wildcard is added to all terms that aren't in quotation marks by default). Your results may still contain a few provincial issues, but most will be greek coins. Here are the results: Alexandria Troas (2168 results)Alexandria Troas ("bc" "b.c." "bce" "b.c.e." "v. chr." "avant j." "a.c." "ac") -(colon pseudo autonom "ad" "a.d." "n. chr.") (313 results) Edited November 23, 2022 by SimonW 1 Quote
Benefactor kirispupis Posted November 23, 2022 · Benefactor Benefactor Posted November 23, 2022 3 hours ago, SimonW said: If you only care about the greek coins and not about the roman provincial issues, use the following search term: Alexandria Troas ("bc" "b.c." "bce" "b.c.e." "v. chr." "avant j." "a.c." "ac") -(colon pseudo autonom "ad" "a.d." "n. chr.") Use the city name you are interested in and add two search groups, one positive, one negative. The positive group contains all terms of which at least one should be present in a result. The negative group contains all words that must not be present in a result (all of them would be associated with roman provincial coins). Use quotation marks where you want to have an exact match and no quotation marks when the term is either unique or may continue (the wildcard is added to all terms that aren't in quotation marks by default). Your results may still contain a few provincial issues, but most will be greek coins. Here are the results: Alexandria Troas (2168 results)Alexandria Troas ("bc" "b.c." "bce" "b.c.e." "v. chr." "avant j." "a.c." "ac") -(colon pseudo autonom "ad" "a.d." "n. chr.") (313 results) Thanks Simon, however that's still quite distant from the goal. Most of the results are still irrelevant since there are many 1st and 2nd century BCE coins that I don't care about. That's why, for my uses at least, being able to provide a year range would be huge. In my collection, for all practical purposes, the world ended around 250 BCE. 🙂 One other nice to have would be to filter on material. Sometimes I try searching for "bronze" or "Ae", which helps, but usually doesn't. Quote
SimonW Posted November 23, 2022 · Member Author Posted November 23, 2022 Well in that case I'd simply use a list of names of the kings, magistrates and/or issuers that are of interest. For example: Alexandria Troas (Lysimachus Antiochus) Or if you are interested in certain denominations (e.g. Tetradrachms), add those too. If your looking for bronze, use "AE" (including the quotation marks). This all helps to narrow down your search. Once you get used to the search syntax, it's quite easy to get exactly what you're looking for 😉 Nonetheless, thank you very much for your inputs. The features you suggested would certainly be nice to have and I'll see what I can do. 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.